“I sense a great disturbance in the Force”, Darth Vader says as Obi-Wan Kenobi boards the Super Star Destroyer. I am a believer in the force field that Jedi knights manipulate with miraculous effects. But I have grown wary of the force fields that computational biologists use in their simulations.
To simulate on computers the behavior of a molecular system, say, the binding of a drug to a protein, or the conformational changes in a signaling molecule, it is crucial to calculate the interactions between the atoms accurately. Although quantum mechanics can compute these interactions in principle, in practice empirical energy functions have to be used due to limited computing power. Many empirical energy functions have been developed and affectionately referred to as force fields. In the beginning, there were only simple interaction parameters for inert gases, then carbon monoxide, then water. Then there was the need to simulate proteins, and complex force fields were developed. Many different force fields emerged independently, each christened with its acronym. There are Amber, Charmm, Gromos. There are OPLS-AA and MMFF. There are many more. Each is a cult with followers and they are constantly at war.
After decades of development, however, computer simulation is still greeted with skepticism by biologists. Compared to experimental technologies of similar age, this is a curious exception. True, almost everybody uses simulation now. But whenever there is a disagreement between the simulation and the experiment, the simulation result becomes the primary suspect. The Journal of Molecular Biology recently rejected a computational paper with the comment that it “will only accept computational works that AGREE with experimental results.” This reflects the general attitude toward computer simulation: it is only good for verifying experiments, but not good enough to generate independent hypothesis.
Sadly for the computational biologists, this dismissal of their work is largely justified. Currently, computer simulations are riddled with problems, and many a simulation have produced results far from well-established experimental observations. The force fields bear much blame. Everyone agrees that they are not good enough. (At the same time everyone contends that his or her force field is BETTER than others.) Everyone agrees that something has to be done about them. But what is wrong with the force field?
The common opinion in the computational community is that the force field needs to be more accurate. The definition of accuracy, of course, is tricky, and depends on the specific problems to which simulation is applied. So although everyone is talking about improving the force field, most people are doing just that, talking.
I think that the force field has a much more crippling problem than its inaccuracy. That is it is too complicated! Way too complicated! In the quest of accuracy, increasing number of parameters have been shoved into the energy functions, so that all present force fields are messy compilations of thousands of parameters. I doubt anyone, including the folks who parameterized the force fields, can correctly remember a tenth of the parameters, or even to recollect why these parameters are chosen over others. For example, in most of the force fields, there are many subtypes for each atomic element. Take OPLS-AA force field. It contains 329 subtypes for carbon atoms, 165 for hydrogen, 76 for nitrogen, and 47 for oxygen. Mendeleev would have turned in his grave.
The complexity of the force field is really the cause of its plight. Worse, it is the complexity without underlying order or reason. In the blind pursuit of accuracy, parameters are fitted ad hoc, without justification. The consequence of the Machiavellian philosophy – “the end justifies the means” – is that only the end remains meaningful. The only useful result from the force field calculation is the total energy. The components, such as van der Waals, hydrogen bond, or electrostatic energies, are only believed by the most faithful.
I was once in a group meeting where a graduate student talked about his calculations of the stability of a structural motif in proteins. He used his results to argue that hydrogen bonds were responsible for the stability. Yet the force field used had nothing to describe hydrogen bonds. I proposed that the stability could come from pure electrostatic interactions and asked him about the charges on the atoms as assigned by the force field. Unsurprisingly he had no clue. He was forgivable. How can someone be expected to know these numbers when there are so many?
Most people use the force field as a black box. Unfortunately, people tend to abuse black boxes. When the simulation works, people over-interpret the results. When it fails, people simply sweep it under the rug. Few try to understand the reason behind the success or failure. To do so one has to open the black box. But it is too messy inside.
Contrast that to experimental techniques. An NMR spectroscopist usually has a good knowledge of the chemical shifts of protons in different environments, and a biologist doing fluorescence experiments knows the absorption band and photon efficiency of the labeling dye. NMR spectroscopy and fluorescence labeling are respectable experimental techniques not because they always give correct results. It is because when they go wrong, the experimenters can understand why.
That is the problem with the force field. People do not know why their simulations succeed or fail. It is like a blind cat trying to catch a dead mouse. The cat may stumble upon the mouse, but the cat does not know why and how to do it again.
In science, the simple always supercedes the complicated. The clear is preferable to the confusing. Inaccuracy in the force field is tolerable as long as the applicability and limitations of the force field are understood and predictable. So instead of pushing for better accuracy by adding atom types and parameters, we should simplify the force fields first. Reduce the number of atom types, reduce the number of independent parameters, and minimize the set of standard data to which to fit the parameters. Make it simple, make it comprehensible, and it will be respectable.
May the force be with us.