Monthly Archives: November 2011

The power of “idle” international scientific cooperation


What if you could convince people to help you doing your research on their spare time? What if you could convince a million people to contribute to a specific scientific effort without the need of recruiting them yourself? Even better if you can get all these exo-collaborators without making a huge dent in your budget, which sometimes is just impossible even if you are willing to do it. The world is an interconnected global one these days; millions of voices sound through the web which makes it hard for yours to stand out. As an interconnected entity, communicating to a large mass has become feasible but it comes with a price: you need to be attractive! That’s right, you can get a million people to help your scientific efforts, its called crowdsourcing (think about wikipedia for instance), but you have to make it rewarding in some way, and since you are trying to convince them to work for you in their free time you have to make it look like something they’d do on that free time; So why not making it a video game?

There are nowadays some serious games which are nothing more than an internet-based platform in which tons of data are loaded and accessed by many users who analyze them while being scored on their achievements in many different ways according to the rules of each game. Remember that famous NASA screensaver (SETI@home) which used the idle time on your computer to crunch data from the SETI (Search for Extraterrestrial Intelligence) project; that is a more passive example of exo-collaboration simply called distributed computing, since the user has to do nothing but allowing the entrance of data into their computers for further processing.

Fold.it is a videogame (that actually began as a distributed computing screensaver called Rosetta@home) that allows you to play around with a protein and fold it in many different ways while you score points according to the conformer’s plausibility. Obtaining the native tertiary (and even the secondary) structure of a protein from no other information than the primary structure is extremely difficult given the enormous amount of available degrees of freedom. Molecular dynamics alone is unable to predict the native tertiary structure of the protein; the number p of possible disulfide bonds present in a protein is p = n!/[(n/2)!2^n/2]  where n is the number of cysteine residues available, plus computers know nothing about proteins or enzymatic catalysis so a hand from us fellow humans and our chemical insight is widely needed. Therefore our previous knowledge of chemistry, biochemistry and the nature of related proteins can help us help those programs in finding the best possible answer to ‘how does this protein look like in 3D space?’ but since this human-helped process is slow and cumbersome you need thousands of people working on it a great deal of time; almost as if every person playing with the same structure was a single core in your computer. Fold.it thus, is a sort of protein self docking, if you will, in which players are ranked according to their skills and rewarded according to how well your structure complies with three simple rules: 1) lack of voids (packing) 2) keeping the orange hydrophobic chains unexposed to the aqueous exterior and 3) avoiding clashes. Scoring functions for these three concepts are calculated and then yield a score for the player which is then ranked to other players folding the same protein (or to other players in their overall performance).

Image via Fold.it wiki

Fold.it has already collected some major success stories such as the one published on Nature Structural & Molecular Biology by David Baker (founder of Fold.it) et al. on September 2011 (doi:10.1038/nsmb.2119) in which players helped in solving the crystal structure of a protease from a retrovirus which causes AIDS in monkeys. The determination of this structure had already taken 15 years of work with only partial success; but the data was available in Fold.it for only three weeks when the appropriate match to the diffraction experiments was found! This case alone has stirred too much attention and for a beautifully written piece about it, you can check this article at the Discover Magazine by Ed Yong.

Other such examples of crowdsourcing in scince, more specifically in astronomy, are Galaxy Zoo and Moon Zoo in which thousands of images from the Hubble telescope and numerous moon probes are made available for users to sort and classify. The aim of Moon Zoo is to study the amount, shape and occurrence of craters, which basically never erode unlike those on Earth. This analysis will let us know more about the origin of our natural satellite and ultimately about the origins of our solar system.

To the participants in this specific kind of scientific crowdsourcing the term Citizen Science is applied and even publications such as the Scientific American magazine host a section where you can call out for volunteers in your projects. Some sort of classified ads for the lonely scientists in their labs in search for “idle” hands that can make a significant contribution to science. Some Citizen Science projects are intended for kids and teenagers as a way to get more people interested in scientific disciplines by engaging them directly in activities with a measurable progress of their own contributions. It is worth mentioning that projects like Fold.it, Moon Zoo and Galaxy Zoo are developed in a way that can be used by people with no expertise in the field in order to recruit as many people as possible just to perform a very specific task, proving thus that the human brain is a powerful and beautiful machine whose insight isn’t equaled by any artificial system, yet.

Well, it is now time to go back to work before I’m deemed a permanent exo-collaborator by my bosses. Just a final thought: What were our mothers saying about us playing too much with our video games?

2011, International Year of Chemistry

As usual please share your thoughts in the comments section, rate this post and let me know that you are out there reading this.

Advertisements

The Gen keyword in Gaussian. Adding an external basis set.


I am frequently asked how to include an extra set of basis functions in a calculation or how to use an entirely external basis set. Sometimes this question also implies the explicit declaration of an external pseudopotential or Effective Core Potential (ECP).

New basis sets and ECPs are published continuously in specialized journals all the time. The same happens with functionals for DFT calculations. The format in which they are published is free and usually only a list of coefficients and exponents are shown and one has to figure out how to introduce it in ones calculation. The EMSL Basis Set Exchange site helps you get it right! It has a clickable periodic table and a list of many (not all) different basis sets at the left side. Below the periodic table there is a menu from which one can select which program we want our basis set for; finally we click on “get basis set” and a pop-up window shows the result in the selected format along with the corresponding references for citation. A multiple query can be performed by selecting more than one element on the table, which generates a list that almost sure can be used as input without further manipulations. Dr. David Feller is to be thanked for leading the creation of this repository. More on the history and mission of the EMSL can be found on their About page. Because of my experience, the rest of the post addresses the inclusion of external basis sets in Gaussian, other programs such as NwChem will be addressed in a different post soon.

The correct format for inclusion of an external basis set is exemplified below with the inclusion of the 3-21G basis set for Carbon as obtained from the EMSL Basis Set Exchange site (blank lines are marked explicitly just to emphasize their location:

spin multiplicity
Molecular coordinates
- blank line -
C     0
S   3   1.00
    172.2560000              0.0617669
     25.9109000              0.3587940
      5.5333500              0.7007130
SP   2   1.00
      3.6649800             -0.3958970              0.2364600
      0.7705450              1.2158400              0.8606190
SP   1   1.00
      0.1958570              1.0000000              1.0000000
****
- blank line -

The use of four stars ‘****’ is mandatory to indicate the end of the basis set specification for any given atom. If a basis set is to be declared for a second atom, it should be included after the **** line without any blank line in between.

WARNING! Sometimes we can find more than one basis set in a single file this is due to different representations, spherical or cartesian basis sets. Gaussian by default uses cartesian (5D,7F) functions. Pure gaussian use 6 functions for d-type orbitals and 10 for f-type orbitals (6D, 10F). Calculations must be consistent throughout, hence all basis functions should be either cartesian or pure.

Inclusion of a pseudopotential allows for more computational resources to be used for calculation of the electronic structure of the valence shell by replacing the inner electrons for a set of functions which simulate the presence of these and their effect (such as shielding) on the valence electrons. There are full core pseudopotentialas, which replace the entire core (kernel). There are also medium core pseudopotentials which only replace the previous kernel to the full one, allowing for the outermost core electrons to be explicitly calculated. The correct inclusion of a pseudopotential is shown below exemplified by the LANL2DZ ECP by Hay and Wadt for the Chlorine atom.

spin multiplicity
Molecular coordinates
- blank line -
basis set for atom1
****
basis set for atom2 (if there is any)
****
- blank line -
CL     0
CL-ECP     2     10
d   potential
  5
1     94.8130000            -10.0000000
2    165.6440000             66.2729170
2     30.8317000            -28.9685950
2     10.5841000            -12.8663370
2      3.7704000             -1.7102170
s-d potential
  5
0    128.8391000              3.0000000
1    120.3786000             12.8528510
2     63.5622000            275.6723980
2     18.0695000            115.6777120
2      3.8142000             35.0606090
p-d potential
  6
0    216.5263000              5.0000000
1     46.5723000              7.4794860
2    147.4685000            613.0320000
2     48.9869000            280.8006850
2     13.2096000            107.8788240
2      3.1831000             15.3439560

If a second ECP is to be introduced, it should be placed right after the first one without any blank line! If a blank line is detected then the program will assume it’s done reading all ECPs and Basis Sets.

Finally, here is an example of a combination of both keywords. If a second ECP was needed then we’d place it at the end of the first one without a blank line. The molecule is any given chlorinated hydrocarbon (H, C and Cl atoms exclusively)

#P B3LYP/gen pseudo=read ADDITIONAL-KEYWORDS
- blank line -
0 1
Molecular Coordinates
- blank line -
H     0
S   3   1.00
     19.2384000              0.0328280
      2.8987000              0.2312040
      0.6535000              0.8172260
S   1   1.00
      0.1776000              1.0000000
****
C     0
S   7   1.00
   4233.0000000              0.0012200
    634.9000000              0.0093420
    146.1000000              0.0454520
     42.5000000              0.1546570
     14.1900000              0.3588660
      5.1480000              0.4386320
      1.9670000              0.1459180
S   2   1.00
      5.1480000             -0.1683670
      0.4962000              1.0600910
S   1   1.00
      0.1533000              1.0000000
P   4   1.00
     18.1600000              0.0185390
      3.9860000              0.1154360
      1.1430000              0.3861880
      0.3594000              0.6401140
P   1   1.00
      0.1146000              1.0000000
****
Cl     0
S   2   1.00
      2.2310000             -0.4900589
      0.4720000              1.2542684
S   1   1.00
      0.1631000              1.0000000
P   2   1.00
      6.2960000             -0.0635641
      0.6333000              1.0141355
P   1   1.00
      0.1819000              1.0000000
****
- blank line -
CL     0
CL-ECP     2     10
d   potential
  5
1     94.8130000            -10.0000000
2    165.6440000             66.2729170
2     30.8317000            -28.9685950
2     10.5841000            -12.8663370
2      3.7704000             -1.7102170
s-d potential
  5
0    128.8391000              3.0000000
1    120.3786000             12.8528510
2     63.5622000            275.6723980
2     18.0695000            115.6777120
2      3.8142000             35.0606090
p-d potential
  6
0    216.5263000              5.0000000
1     46.5723000              7.4794860
2    147.4685000            613.0320000
2     48.9869000            280.8006850
2     13.2096000            107.8788240
2      3.1831000             15.3439560
- blank line -

If you like this post or found it useful please leave a comment, share it or just give it a like. It is as much fun to find out people is reading as it is finding the answer to ones questions in someone else’s blog 🙂

Peace out!

%d bloggers like this: