Benchmark proteins (eventually other biomolecules) using F@H
In broad terms, what are you trying to do?
Benchmark OpenFF force fields for protein structure using F@H. We would want to launch a bunch of jobs with just plain ol' proteins in water to gather a large aggregate time. We will probably eventually need to do enhanced sampling, probably using Markov state modeling or other method that is appropriate to OpenFF, but just getting a lot of aggregate simulation is likely to help a lot to converge NMR observables.
We may need to run non-OpenFF force fields for comparison. amber14sb would probably be trivial, anything else could be hard (and maybe we don't bother?)
It does not matter what software is used, it should be the same results GROMACS or OpenMM.
How do you believe using this project would help you to do this?
We need to gather large amounts of aggregate simulation time to test NMR observables and other observables to compare to experiment. It is unlikely we will get enough aggregate simulation time without F@H to do statistically meaningful tests.
What problems do you anticipate with using this project to achieve the above?
- Gathering the large amount of data back to analyze with HDXER or other software.
- If doing enhanced sampling, deciding on what the best way is to do the enhanced sampling and manage that. We could presumably lean quite a bit on Bowman/Voelz efforts so far - definitely don't want to reinvent the wheel.
I'll provide some more detail about the anticipated simulation needs. Based on previous protein force field benchmarks, we will estimate NMR observables (chemical shifts, scalar couplings, and NOEs) from unbiased MD simulations for three sets of protein systems:
- 32 small peptides (2 to 5 residues), 500 ns trajectories
- 45 folded proteins (largest 216 residues), 10 μs trajectories
- 10 disordered proteins (largest 140 residues), 30 μs trajectories Each of these systems should be run in triplicate for each force field studied. Force fields will be OpenFF with two water models (TIP3P and OPC) and Amber ff14SB with OPC.
This proposal would entail 6.9 ms of aggregate sampling from unbiased MD for all systems and force fields (2.3 ms per force field/water model). Going beyond the above will require enhanced sampling algorithms, and I agree with @mrshirts that we should rely on expertise from the Voelz and Bowman groups for this.
I wonder if the requirements here are kind of a "zero" case for another workflow - Like #1 but with edge/transformation, or #6 but no ligand?
This may face the same criticism as #1 ("why do we need a giant supercomputer for this?"), and the answer is that (per Chapin's message above) the protein observable benchmarks add up to 6.9ms of simulation per shot. We'll want to take at least one shot to just benchmark the Rosemary release, and it will be really helpful to be able to use the same infrastructure to expand comparisons to include additional FFs, more proteins, or to test improvements to Rosemary.
So the inputs would be:
- solvated (?) protein structure file (likely PDB)
- force field definition (OFFXML, OFFXML name, or openmmforcefields-supported option)
- simulation time, temperature, and other parameters
And the desired output would be the raw trajectory.
Raw notes from story review, shared here for visibility:
- need to be able to run equilibrium simulations, no alchemical transformations
- other FFs besides OpenFF executable; engine doesn't matter, Gromacs or OpenMM
- probably ideal to be able to run both, to ensure result isn't fundamentally different between engines
- needed for testing NMR and other observables for comparison to experiment
- unlikely to get needed aggregate sim time with other compute resources
- in particular, need to be able to get large amounts of data back to analyze with HDXER and other software; need full trajectories?
- perhaps there are aggregate results that can be calculated within the system to make this less data-intensive downstream?
- want to avoid routine, massive downloads from S3 where possible
- perhaps there are aggregate results that can be calculated within the system to make this less data-intensive downstream?
- antipated simulation volume for this effort is 6.9ms of aggregate sampling per run, with much of this composed of max 140 residue (disordered proteins) / 216 residue (ordered, folded proteins) systems
- will seriously need to consider observable calculation in-system, as results come back to the work server
My preference for this use case is to have full trajectories as output and do the analysis locally. You can use multiple forward models for each type of observable, and you can use the same set of trajectories for comparison with multiple experimental observables. If we decide later than we want to swap out the forward model or include an additional observable, it will be useful to have the raw trajectories on hand.