qca-dataset-submission icon indicating copy to clipboard operation
qca-dataset-submission copied to clipboard

Run OpenMM Parsley torsion scans

Open ChayaSt opened this issue 5 years ago • 17 comments

I want to run torsion scans with Parsley and OpenMM on the OpenFF Substituted Phenyl Set 1. The JSON for the inputs are here:

https://github.com/openforcefield/qca-dataset-submission/blob/master/2019-07-25-phenyl-set/phenyl_set_input.tar.jz and https://github.com/openforcefield/qca-dataset-submission/blob/master/2019-07-25-phenyl-set/biphenyls_set_input.tar.gz

The second JSON file is higher priority than the first so those should be run first.

Thanks!

ChayaSt avatar Jan 21 '20 21:01 ChayaSt

Can we just run this on all of OpenFF Substituted Phenyl Set 1?

dgasmith avatar Jan 23 '20 18:01 dgasmith

Yes - All the molecules in the linked JSONs are in the OpenF Substituted Phenyl Set 1

ChayaSt avatar Jan 23 '20 18:01 ChayaSt

Still some issues in running this through QCEngine:

import qcfractal.interface as ptl
import qcengine as qcng

client = ptl.FractalClient()

ds = client.get_collection("torsiondrivedataset", "OpenFF Substituted Phenyl Set 1")
ds.query("default")

good = []
for idx, row in ds.df["default"].items():

    mol = client.query_molecules(id=row.initial_molecule[0])[0]

    inp = {
    "molecule": mol,
    "driver": "energy",
    "model": {"method": "openff-1.0.0", "basis": "smirnoff"}
    }

    ret = qcng.compute(inp, "openmm")
    if ret.success:
        good.append(idx)

print(good)
print(len(good))
print(ds.df.shape)

78/154 succeed at the moment.

dgasmith avatar Feb 08 '20 14:02 dgasmith

I didn't run the whole set, but many of the errors seemed to be RDKit's "a neutral nitrogen has four bonds" complaint. I see that QCEngine screens to remove molecules with net charge, but in this case, zwitterions like a nitro group (with N+ and O-) are causing the trouble. The relevant code is here: https://github.com/MolSSI/QCEngine/blob/c8ae155267a69af0be2a63034ea9224bbebd9d34/qcengine/programs/rdkit.py#L33-L74

The next release of OFFTK (mid-March, I think) will also include @jthorton's OFFMol.from_qcschema and OFFMol.from_tagged_smiles functions, either of which should be able to solve this as long as there's a (S/C)MILES around.

Another option (and I haven't thought this fully through) is that knowing net charge, elements, connectivity (with bond orders), and explicit Hs, MAY let us guess suitable formal charges for a graph mol. I did this in a notebook to see if it would work in the RDKit Harness and it may be dangerous, but it works. Here's a code snippet showing how the RDKit Harness code could be modified to work for the example above.

from rdkit import Chem
from qcengine.units import ureg
import qcportal as ptl
import qcengine as qcng

client = ptl.FractalClient()

ds = client.get_collection("torsiondrivedataset", "OpenFF Substituted Phenyl Set 1")
ds.query("default")

good = []
items = [i for i in ds.df["default"].items()]
idx, row = items[7]
mol = client.query_molecules(id=row.initial_molecule[0])[0]

### This part is in RDKitHarness._process_molecule_rdkit

# Handle errors
if abs(jmol.molecular_charge) > 1.0e-6:
    raise InputError("RDKit does not currently support charged molecules.")

if not jmol.connectivity:  # Check for empty list
    raise InputError("RDKit requires molecules to have a connectivity graph.")

# Build out the base molecule
base_mol = Chem.Mol()
rw_mol = Chem.RWMol(base_mol)
for sym in jmol.symbols:
    rw_mol.AddAtom(Chem.Atom(sym.title()))

# Add in connectivity
bond_types = {1: Chem.BondType.SINGLE, 2: Chem.BondType.DOUBLE, 3: Chem.BondType.TRIPLE}
for atom1, atom2, bo in jmol.connectivity:
    rw_mol.AddBond(atom1, atom2, bond_types[bo])

# Predict formal charges on atoms (needed to construct valid graph molecule)
# NOTE: This is only sufficient if 
# 1) all hydrogens are explicit
# 2) bond orders are known
# 3) The net charge on the molecule is known (possible valences of P and S can introduce ambiguity)
neutral_valences = {'H': (1,), 'C': (4,), 'O': (2,), 
                    'N': (3,), 'P': (4, 6), 'S': (2, 6), 
                    'F': (1,), 'Cl': (1,), 'Br': (1,), 
                    'I': (1,)}
net_charge = 0
for atom in rw_mol.GetAtoms():
    # Get the sum of the bond orders involving this atom
    valence_sum = sum([bond.GetBondType() for bond in atom.GetBonds()])
    element_symbol = qcelemental.periodictable.E[atom.GetAtomicNum()]
    if element_symbol not in neutral_valences:
        raise InputError(f"RDKit harness does not currently support element {element_symbol}")
    possible_charges = [valence_sum - neutral_valence for neutral_valence in neutral_valences[element_symbol]]
    # Sort to put the lowest-magnitude charge first
    possible_charges.sort(key=lambda x: abs(x))
    atom_charge = possible_charges[0]
    atom.SetFormalCharge(atom_charge)
    net_charge += atom_charge
if net_charge != jmol.molecular_charge:
    raise InputError(f"RDKit harness is unable to deduce formal charges on molecule {jmol}.")

mol = rw_mol.GetMol()


# Write out the conformer
natom = len(jmol.symbols)
conf = Chem.Conformer(natom)
bohr2ang = ureg.conversion_factor("bohr", "angstrom")
for line in range(natom):
    conf.SetAtomPosition(
        line,
        (
            bohr2ang * jmol.geometry[line, 0],
            bohr2ang * jmol.geometry[line, 1],
            bohr2ang * jmol.geometry[line, 2],
        ),
    )

mol.AddConformer(conf)

Chem.rdmolops.SanitizeMol(mol)

print(Chem.MolToSmiles(mol))

[H]c1c([H])c(C(=O)OC([H])([H])C([H])([H])[H])c([H])c(N+[O-])c1[H]

j-wags avatar Feb 09 '20 07:02 j-wags

Any thoughts on waiting for the OFFTK release or changing the current scheme to reflect above?

Pining @jchodera as well.

dgasmith avatar Feb 10 '20 16:02 dgasmith

Talked to @jthorton a bit about this today. Here's the our take on this:

  • bad option If we don't have access to a CMILES, but we want to OFF-parameterize a QCA molecule (which we assume contains connectivity with bond order, element, and total charge), we have to guess formal charge on the atoms. Guessing formal charge (like I do in the snippet above) is dangerous and runs the risk of misinterpreting the graph molecule. So if we choose to do the "freshman-chemistry-guess-the-formal-charge" approach above, we should version it so the guessing algorithm can be updated as we find new common valences or include new elements.
  • medium option Josh's new function (coming out in the next OFFTK release) can load QCSchema mols that have CMILES (so the "initial molecules"), but not the "final molecules" by themselves (since they don't have CMILES). It is possible to load the "initial molecule" and then apply coordinates from the "final molecule", but this will give really high energies if the bonds are rearranged. So if we could make the QCEngine call require both the initial molecule and final geometry, then we can run the calculations, they'll just return very high energy if bonds are rearranged.
  • best option Same as above, but with connectivity rearrangement checks (inferring connectivity from interatomic distance). We've sketched out the functionality for checking this connectivity, but I don't think it's implemented currently (@jthorton can correct if I'm wrong). With that ability, we could throw an informative error if the connectivity changed.

@jthorton pointed out that we wouldn't need to make the QCEngine call take both the initial and final molecules if we could sneak CMILES into the extras field of final molecules (as extras['initial_canonical_isomeric_explicit_hydrogen_mapped_smiles'])

j-wags avatar Feb 10 '20 21:02 j-wags

@dgasmith and I chatted about this recently. I still have to catch up on the details, but is it possible this is just a case where we need to harmonize the bond order scheme used by QCElemental with the bond order scheme we use within openforcefield.topology.Molecule to ensure we can interconvert with RDKit/OpenEye?

We should also be using the openforcefield.topology.Molecule object for all of the instantiation of toolkit molecules, simplifying the above conversion examples to something like

from openforcefield.topology import Molecule
molecule = Molecule.from_qcarchive(qcarchive_json)
rdmol = molecule.to_rdkit()

jchodera avatar Feb 11 '20 00:02 jchodera

is it possible this is just a case where we need to harmonize the bond order scheme used by QCElemental with the bond order scheme we use within openforcefield.topology.Molecule to ensure we can interconvert with RDKit/OpenEye?

No, the issue is that graph molecules need atomic formal charges, which it isn't trivial to unambiguously deduce from the QCElemental molecule's information (connectivity + bond orders + elements + total charge). This is because since some atoms can have multiple accepted valence states, for example sulfur can have two bonds (thiol) or six bonds (sulfonamide).

If we hammer on the code for deducing the formal charge a bit, we could come up with a pretty good solution. But, like all code, it would make assumptions and have bugs, so we will want to keep a versioning scheme for it handy.

j-wags avatar Feb 11 '20 03:02 j-wags

Thanks for clarifying!

I guess the important questions we need to ask are:

  1. Do we actually need to set the formal charges when creating a toolkit molecule?
  2. If so, if we use a simple heuristic to assign formal charges based on valence and net charge, would the cheminformatics toolkits correct the formal charges if they are incorrect?
  3. Even if the toolkits don't correct the formal charges, are there any consequences for any of the operations we carry out (such as assigning AM1-BCC charges or generating conformers) if the formal charges are incorrect but the valence and net charge are correct?

If we do need a simple heuristic, iterating over combinations of valid formal charge combinations in a manner that places negative charge on more electronegative molecules should work just fine.

jchodera avatar Feb 12 '20 13:02 jchodera

This paper could be useful in the future: [PDF]

jchodera avatar Feb 12 '20 13:02 jchodera

Do we actually need to set the formal charges when creating a toolkit molecule?

As far as I know, RDKit needs the formal charge. Otherwise it will throw an error that the valance is wrong.

ChayaSt avatar Feb 12 '20 13:02 ChayaSt

As far as I know, RDKit needs the formal charge. Otherwise it will throw an error that the valance is wrong.

@ChayaSt : What happens if we have one or more sites that have multiple possible valences, and we provide valid (but suboptimal) formal charges? Does it have any consequence for generating conformers or partial charges?

jchodera avatar Feb 17 '20 17:02 jchodera

Do we actually need to set the formal charges when creating a toolkit molecule?

Yes. We have some SMARTS that get matched to specific formal charges https://github.com/openforcefield/openforcefields/blob/master/openforcefields/offxml/openff-1.0.0.offxml#L24

If so, if we use a simple heuristic to assign formal charges based on valence and net charge, would the cheminformatics toolkits correct the formal charges if they are incorrect?

Not sure. This issue originally came up because RDKit was being strict about nitrogens with too many bonds, so we'd need to look into to sanitization options that aren't part of our normal workflow. However there is probably a way to make it guess... Chem.MolFromPDB must do something for this problem.

Even if the toolkits don't correct the formal charges, are there any consequences for any of the operations we carry out (such as assigning AM1-BCC charges or generating conformers) if the formal charges are incorrect but the valence and net charge are correct?

It shouldn't matter for QM -- the methods I know of don't WANT to know the formal charges. Just the total charge on the molecule. Conf gen might need to know -- AFAIK they'll use MMFF94 which probably needs knowledge of charges. And, per above, our SMARTS matching requires knowledge of formal charges.

j-wags avatar Feb 17 '20 19:02 j-wags

@mattwthompson This may be largely solved by #472, though it would be good to test that solution. Further extensions may try to detect proton migrations/connectivity rearrangements by comparing the connectivity derived from geometry or Wiberg Bond Orders to that described in the CMILES.

j-wags avatar Mar 30 '20 17:03 j-wags

@j-wags @bennybp, what is the status of the openmm torsion scans? I tried checking the status and I'm getting an error:

import qcfractal.interface as ptl
client = ptl.FractalClient()
dataset = client.get_collection('TorsionDriveDataset', 'OpenFF Substituted Phenyl Set 1')
dataset.status('openmm')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in get_specification(self, name)
    372         try:
--> 373             return self.data.specs[name.lower()].copy()
    374         except KeyError:

KeyError: 'openmm'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-1-73c09af38155> in <module>
      2 client = ptl.FractalClient()
      3 dataset = client.get_collection('TorsionDriveDataset', 'OpenFF Substituted Phenyl Set 1')
----> 4 dataset.status('openmm')

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in status(self, specs, collapse, status, detail)
    580                 list_specs = []
    581                 for spec in specs:
--> 582                     list_specs.append(self.query(spec))
    583 
    584             def get_status(item):

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in query(self, specification, force)
    513         """
    514         # Try to get the specification, will throw if not found.
--> 515         spec = self.get_specification(specification)
    516 
    517         if not force and (spec.name in self.df):

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in get_specification(self, name)
    373             return self.data.specs[name.lower()].copy()
    374         except KeyError:
--> 375             raise KeyError(f"Specification '{name}' not found.")
    376 
    377     def list_specifications(self, description=True) -> Union[List[str], pd.DataFrame]:

KeyError: "Specification 'openmm' not found."

ChayaSt avatar May 05 '20 18:05 ChayaSt

@ChayaSt The name of the specification (first argument to status) is default

dataset.status('default') gives me 151 complete, 3 errors

(see dataset.list_specifications())

bennybp avatar May 05 '20 19:05 bennybp

@bennybp, that is for the B3lyp torsion scans. I'm trying to access the MM torsion drive.

On second thought, this might have not been fully deployed in QCEngine.

ChayaSt avatar May 05 '20 20:05 ChayaSt