QCEngine icon indicating copy to clipboard operation
QCEngine copied to clipboard

Generate `basis` if not specified in OpenMMHarness

Open dotsdl opened this issue 4 years ago • 18 comments

During #151, it was decided that the basis field in AtomicInput for the OpenMMHarness need not be specified. When not specified, it would be generated from the contents of url or offxml. This is partially implemented, but in a fragile way that is not desirable, in OpenMMHarness._generate_basis.

Ideally, the basis contents would be generated as:

  • f"{forcefield_name}-{hash(forcefield_schema)}" for non-versioned forcefields
  • f"{forcefield_name}-{forcefield_version}" for versioned forcefields

However, there does not appear to exist a way to pull forcefield_version from the XML contents of an e.g. SMIRNOFF force field. This will likely require this addition to future releases of SMIRNOFF force fields.

dotsdl avatar Dec 08 '19 18:12 dotsdl

Thanks for starting this discussion, @dotsdl.

@dgasmith brought up this Issue this morning, and we had a productive conversation. Basically, we thought a scheme like this would be good starting point for discussion

Use examples

Desired force field Program Method Basis Keywords
openff-1.0.0 openforcefield parsley 1.0.0 constraints:true
openff_unconstrained-1.0.0 openforcefield parsley 1.0.0 none
Parsley with some simple bugfix (1.0.1) openforcefield parsley 1.0.1 none
Sage openforcefield sage 1.1.0 None
Sage openforcefield sage Latest? None
Rdkit UFF rdkit uff None None
AMBER GAFF ambertools gaff 2.11 {ES/NB cutoff, etc}

The "openforcefield" program string would be used to mean "the openforcefield toolkit and all dependencies, including OpenMM and openforcefields (the conda package containing Parsley)"

The "constraints" keyword

Parsley's technical (actual) name is "openff-1.0.0.offxml", and it is intended for long-timestep MD. There is another "Parsley" that is intended for small molecule operations, which differs only by the removal of hydrogen bond length constraints, called `openff_unconstrainted-1.0.0.offxml". We understand that this is confusing, and it's caused by the fact that OpenFF and OpenMM consider the hbond constraints to be part of the FF (since it affects system energy) while most other comp chem packages don't. It's very confusing for everyone involved, but we have to do it this way unless we rewrite large portions of ParmEd.

We basically came to the conclusion that, if someone is running Parsley in QCEngine, they probably don't want hbond length constraints, so the default should be to use the unconstrained file, however the constrained one could be requested by adding keywords (like hbond_constraints: true

FF codenames and semantic version numbers

Another issue was the fact that Parsley is anticipated to be the whole 1.0.Z series of FFs. Since the main purpose of OpenFF is to show incremental, largely automated improvement in FFs, we left ourselves wiggle room to have "bugfix" version increments within the same FF name. Currently there is no openff-1.0.1, but there easily could be in the future. This has two consequences:

  • if we have "basis" be the X.Y.Z version, there will need to be some validation that occurs to ensure that "Parsley" lines up with "1.0.Z", and "Sage" lines up with "1.1.Z"
  • there is some danger in allowing Latest as a basis, since Parsley Latest could refer to different FFs on different days. We could consider these to be "significant" changes. For this reason, we think we should forbid "Latest" basis values on the global QCArchive, while QCEngine could support it for local runs.

j-wags avatar Jan 24 '20 01:01 j-wags

In row 3 of the table, is there a typo in the version?

mattwelborn avatar Jan 24 '20 08:01 mattwelborn

Thanks, @j-wags for the detailed post on the topic.

After thinking about the proposed scheme from the early discussion I would say that it looks good but would suggest some minor tweaks. I mainly think that the method should imply the X.Y of the version number (for example parsley-> 1.0.Z sage-> 1.1.Z ) and basis should just cover the minor bump number. In this case, I would not allow "latest" as an option to keep consistent with the rest of QCEngine where a basis should be defined or None. We could possibly allow None to default to Z=0 version of the force field which implies you just want initial parsley or sage with no extras.

jthorton avatar Jan 24 '20 13:01 jthorton

As a note we are considering the program=openforcefield rather than program=openmm for the following reasons:

  1. QCEngine wants to know a priori if it can run a task or not which has been chosen to have a single boolean at the program found level for simplicity. Since OpenMM may have many programs (or none) to type and generate the System object (openforcefield, RDKit, Antechamber, etc) this becomes highly parametrizable and the ability to run a task now depends on not just the program, but the route inside the harness the task will take.
  2. It simplifies the complexities of running MM computations. MM in QCEngine is always going to be a bit awkward, at the end of the day I think we will have two users here 1) I am a QC person and just want to run a forcefield (really any force field) to clean up my molecule. 2) I want to benchmark my forcefield within a QC framework.
  3. program=openforcefield would rely on the OpenMMHarness and would lock in System generation based on its own best practices and thus have a concrete dependancy tree.

Point 1) Is really a blocker from a Fractal usage point of view. This is solvable, but would take some moderate engineering time.

@loriab

dgasmith avatar Jan 24 '20 14:01 dgasmith

This splitting sounds pretty complicated, since we have to hard-code a translation table that allows us to go from a force field name to the JSON entries and vice-versa in order to be able to interpret anything. Something as simple as querying for "show me data with openff-1.0.0" becomes quite complicated and difficult to search for since complex queries are required.

I was going to propose we shift to using openmm-forcefields under the hood to shift the burden if resolving how to parameterize small molecules to the OpenMM devs, since we can easily pick up GAFF 1/2 support this way and add more support in the future through the SystemGenerator mechanism, which provides a mechanism for querying which force fields via SystemGenerator.SMALL_MOLECULE_FORCEFIELDS. However, this would only report those that are potentially available---what is actually able to be calculated will depend on what the qcfractal workers have installed.

The drawback is that there are a couple of things that have to happen before everything is released on conda, so we can continue to use the current proposed method as long as we standardize the program/method/basis/keywords choices.

jchodera avatar Jan 24 '20 16:01 jchodera

I don't quite understand, splitting would make this as easy if not easier. program='openforcefield', method='parsley', method=parsley, basis='1.0.2' would all return relevant queries.

Using openmm-forcefields as a program where it depends on other libraries installed is going to be a nightmare from a Fractal point of view (think hf3c errors, but more systemic).

dgasmith avatar Jan 24 '20 16:01 dgasmith

I don't quite understand, splitting would make this as easy if not easier. program='openforcefield', method='parsley', method=parsley, basis='1.0.2' would all return relevant queries.

Here's why this is confusing: The way this force field is instantiated in Python is:

from openforcefield.typing.engines.smirnoff import ForceField
forcefield = ForceField('openff_unconstrained-1.0.0.offxml')

Or, if using the constrained version (program='openforcefield', method='parsley', method=parsley, basis='1.0.2', keywords=['constraints:true']),

from openforcefield.typing.engines.smirnoff import ForceField
forcefield = ForceField('openff-1.0.0.offxml')

The mapping table to translate back and forth between the filename the force field refers to and the query you must use to find results for it (or, equivalently, the information contained in the JSON file) is so complicated that even our lead software engineer made at least one mistake in constructing it.

Unless I'm missing something that would easily automate the translation back and forth between the string used to invoke the force field through ForceField and the rendering of this in QCSchema, the proposed scheme is just too complicated to be practical.

Edit: This is so complicated I also made a mistake in the first version of this post.

jchodera avatar Jan 25 '20 04:01 jchodera

Using openmm-forcefields as a program where it depends on other libraries installed is going to be a nightmare from a Fractal point of view (think hf3c errors, but more systemic).

One other practical point here: The dependencies that openmm-forcefield requires still need to be installed in order to compute the energies on the QCFractal client, so we're going to run into this issue no matter what. Perhaps we need some way for QCFractal clients to advertise their capabilities?

jchodera avatar Jan 25 '20 04:01 jchodera

@jchodera Fair enough, can you build a table similar to @j-wags so that we can get something concrete to look at implementing here?

dgasmith avatar Jan 25 '20 16:01 dgasmith

The most important idea is to make sure we keep the prefixes whole (e.g. smirnoff99Frosst-1.1.0, openff-1.0.0) so that it is unambiguous how to both find the data and identify the force field that generated it.

Constraints are specified unambiguously in the .offxml file (formed by prefix + '.offxml' for SMIRNOFF force fields).

Keywords can instead reflect any special keywords that are used to instantiate system construction for GAFF force fields (added when we switch to openmm-forcefields in a later PR).

The "Program" can refer to the program that actually computed the energy.

The "Method" can refer to the typing engine that provided the parameters (e.g. smirnoff vs antechamber)

The "Basis" can refer to the force field specification string that uniquely identifies the file feeding the "Method" (e.g. openff-1.0.0, smirnoff99Frosst-1.1.0, gaff-1.81, gaff-2.11).

If we want, we can add "codename" to "keywords" for convenience, but these are just codenames.

Here's an example of how that would look when we expand support with openmm-forcefields and RDKit:

Desired force field Program Method Basis Keywords
smirnoff99Frosst-1.1.0 openmm smirnoff smirnoff99Frosst-1.1.0 none
openff-1.0.0 openmm smirnoff openff-1.0.0 none
openff_unconstrained-1.0.0 openmm smirnoff openff_unconstrained-1.0.0 none
openff-1.0.1 openmm smirnoff openff-1.0.1 none
openff-1.1.0 openmm smirnoff openff-1.1.0 none
AMBER gaff-1.81 openmm antechamber gaff-1.81 {'constraints:HBonds', 'nonbondedMethod:NoCutoff'}
AMBER gaff-2.11 openmm antechamber gaff-2.11 {'constraints:none', 'nonbondedMethod:NoCutoff'}
RDKit UFF rdkit ForceField UFF none
RDKit MMFF94 rdkit ForceField MMFF94 none
RDKit MMFF94s rdkit ForceField MMFF94s none
OpenEye MMFF94 oechem szybki MMFF94 none
OpenEye MMFF94s oechem szybki MMFF94s none

To me, this looks much simpler, and it also keeps the implementation simple and extensible (since there's a simple way to dispatch the basis field to calculations under the hood.

jchodera avatar Jan 25 '20 18:01 jchodera

Seems structurally good. One item is to consider switching Method/Basis, something like UFF/GAFF make much more sense in the QC world as a method while something like the typing engine refers to the basis.

dgasmith avatar Jan 25 '20 22:01 dgasmith

Using openmm-forcefields as a program where it depends on other libraries installed is going to be a nightmare from a Fractal point of view (think hf3c errors, but more systemic).

One other practical point here: The dependencies that openmm-forcefield requires still need to be installed in order to compute the energies on the QCFractal client, so we're going to run into this issue no matter what. Perhaps we need some way for QCFractal clients to advertise their capabilities?

I'm not sure how relevant this would be, but it is possible to modify the found method to additionally take method/driver (or AtomicInput containing) and judge whether additional deps are available to satisfy that method. This wouldn't help with the broad available methods list from qcng but could help with a quick exit at runtime since found is called at the start of compute.

loriab avatar Jan 25 '20 23:01 loriab

Seems structurally good. One item is to consider switching Method/Basis, something like UFF/GAFF make much more sense in the QC world as a method while something like the typing engine refers to the basis.

I'm happy to defer to your preferences here! To me, the important principles are simply to keep the prefix needed to form the offxml file on the search path (e.g. smirnoff99Frosst-1.1.0) or drive SystemGenerator (e.g. gaff-1.81) whole.

jchodera avatar Jan 26 '20 20:01 jchodera

@dotsdl @mattwelborn @j-wags @loriab

Any comments before I implement this?

dgasmith avatar Jan 29 '20 18:01 dgasmith

I’m happy with the current proposal

On Jan 29, 2020, at 13:44, Daniel Smith [email protected] wrote:

 @dotsdl @mattwelborn @j-wags @loriab

Any comments before I implement this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mattwelborn avatar Jan 29 '20 18:01 mattwelborn

This looks good to me. I think it's a good change to require the FF filename/prefix instead of the "media" name, Parsley.

j-wags avatar Jan 29 '20 19:01 j-wags

First pass through whole thread -- sorry I missed where I was pinged.

Proposal looks good (insofar as my limited MM usage extends). I think I agree with switching method/basis columns here. From the QM perspective, method is the guts and basis a detail, and if any is skipped or redundant it should be basis.

loriab avatar Jan 29 '20 19:01 loriab

I like where the proposal has landed. @dgasmith please let me know if you want help with implementation or review when the PR is up.

dotsdl avatar Jan 30 '20 05:01 dotsdl