openff-toolkit
openff-toolkit copied to clipboard
thiophene gets only 1 C atom type, but 2 different C-C bond lengths
Hi folks, I am working towards a python-based workflow for the parameterisation of a library of aromatic polymer fragments. In the past, I used exclusively ambertools/tleap but I would like to speed up and automatise my workflow, hence my interest in openff-toolkit.
Describe the bug
In the example below, I try to get a topology file for thiophene starting from SMILES. However the resulting file (attached) contains only 1 carbon type, C1. To my surprise I get 2 different H types, H1 and H2:
; name at.num mass charge ptype sigma epsilon
C1 6 12.010780 0.00000000 A 0.34806469 0.36350306
S1 16 32.065500 0.00000000 A 0.35635949 1.046
H1 1 1.007947 0.00000000 A 0.24536265 0.054839681
H2 1 1.007947 0.00000000 A 0.25725815 0.06531786
on the other hand, the topology correctly differentiates between the 2 kinds of C-C bonds:
[ bonds ]
; ai aj funct c0 c1 c2 c3
1 2 1 0.13717 334016.498332
2 3 1 0.14546 233580.248618
3 4 1 0.13717 334016.498332
4 5 1 0.17602 225745.023800
5 1 1 0.17602 225745.023800
1 6 1 0.10854 332422.631675
2 7 1 0.10854 332422.631675
3 8 1 0.10854 332422.631675
4 9 1 0.10854 332422.631675
To Reproduce
import parmed
from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
forcefield = ForceField("openff_unconstrained-2.0.0.offxml")
mol = Molecule.from_smiles('c1cccs1')
top = Topology.from_molecules(mol)
sys = forcefield.create_openmm_system(top)
structure = parmed.openmm.load_topology(top.to_openmm(), system=sys)
structure.residues[0].name = "RES"
structure.save('test_thio.top')
This ambiguity in the topology results in an error when I try and write amber files with parmed:
structure.save('test_thio.prmtop')
test = parmed.load_file('test_thio.prmtop')
parmed.tools.actions.writeFrcmod(test, 'test_thio.frcmod').execute()
Output
---------------------------------------------------------------------------
ParameterError Traceback (most recent call last)
Input In [87], in <cell line: 1>()
----> 1 parmed.tools.actions.writeFrcmod(test, 'test_thio.frcmod').execute()
File ~/anaconda3/envs/mamba/envs/mdapsikit/lib/python3.9/site-packages/parmed/tools/actions.py:344, in writeFrcmod.execute(self)
342 if not Action.overwrite and os.path.exists(self.frcmod_name):
343 raise FileExists('%s exists; not overwriting' % self.frcmod_name)
--> 344 parmset = AmberParameterSet.from_structure(self.parm)
345 title = 'Force field parameters from %s' % os.path.split(str(self.parm))[1]
346 parmset.write(self.frcmod_name, title=title)
File ~/anaconda3/envs/mamba/envs/mdapsikit/lib/python3.9/site-packages/parmed/amber/parameters.py:364, in AmberParameterSet.from_structure(cls, struct)
349 @classmethod
350 def from_structure(cls, struct):
351 """ Extracts known parameters from a Structure instance
352
353 Parameters
(...)
362 The parameter set with all parameters defined in the Structure
363 """
--> 364 return super(AmberParameterSet, cls).from_structure(struct, allow_unequal_duplicates=False)
File ~/anaconda3/envs/mamba/envs/mdapsikit/lib/python3.9/site-packages/parmed/parameters.py:228, in ParameterSet.from_structure(cls, struct, allow_unequal_duplicates)
226 if key in params.bond_types:
227 if not allow_unequal_duplicates and params.bond_types[key] != bond.type:
--> 228 raise ParameterError('Unequal bond types defined between %s and %s' % key)
229 continue # pragma: no cover
230 typ = copy(bond.type)
ParameterError: Unequal bond types defined between C1 and C1
Copying the parmed error is a bit of a tangent, but I hope you don't mind. My overall goal is to save each small fragment as a library file and build polymers by mixing and matching - not sure whether this could be achieved entirely through openff/openmm data structures?
Computing environment:
- Mac OS 11.4
- ambertools 21.11
- openbabel 3.1.1
- openff-bespokefit 0.1.1
- openff-forcefields 2.0.0
- openff-fragmenter-base 0.1.2
- openff-qcsubmit 0.3.1
- openff-toolkit 0.10.4
- openff-toolkit-base 0.10.4
- openff-utilities 0.1.3
- openmm 7.7.0
Additional context
Thanks for the thorough report - I'm pretty sure this is a bug in ParmEd and/or this interface to it, but I haven't looked closely enough to say for sure. I'll have another look tomorrow and update you then.
My overall goal is to save each small fragment as a library file and build polymers by mixing and matching - not sure whether this could be achieved entirely through openff/openmm data structures?
I assume your fragments here are monomers of some sort within a larger, connected polymer - i.e. bonded to neighboring fragments (/residues/monomers)?
I'm pretty sure this is a bug in ParmEd and/or this interface to it
I saved the OpenMM system into an xml file and the treatment of Cs and Hs is consistent with what I see in ParmEd-saved topology files:
<Particle eps=".3635030558377792" q="-.1688777777777778" sig=".3480646886945065"/> #--> assigned type C1
<Particle eps=".3635030558377792" q="-.1567777777777778" sig=".3480646886945065"/> # --> C1 but I'd expect it to be C2
<Particle eps=".3635030558377792" q="-.1567777777777778" sig=".3480646886945065"/> # --> C1 but I'd expect it to be C2
<Particle eps=".3635030558377792" q="-.1688777777777778" sig=".3480646886945065"/> #--> assigned type C1
<Particle eps="1.046" q=".008422222222222216" sig=".35635948725613575"/> #--> S1
<Particle eps=".05483968129296432" q=".16922222222222222" sig=".2453626527729973"/> #--> assigned type H1
<Particle eps=".06531785996356952" q=".1522222222222222" sig=".25725815350632797"/> #--> assigned type H2 (why?)
<Particle eps=".06531785996356952" q=".1522222222222222" sig=".25725815350632797"/>
<Particle eps=".05483968129296432" q=".16922222222222222" sig=".2453626527729973"/>
I assume your fragments here are monomers of some sort within a larger, connected polymer - i.e. bonded to neighboring fragments (/residues/monomers)?
exactly. I know this is currently not in scope, but I'm trying to see what I can retool from existing libraries/projects.
A coarse solution is to add this flag (https://github.com/ParmEd/ParmEd/pull/1087/files) when saving the files in the first place:
import parmed
from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
forcefield = ForceField("openff_unconstrained-2.0.0.offxml")
mol = Molecule.from_smiles('c1cccs1')
top = Topology.from_molecules(mol)
sys = forcefield.create_openmm_system(top)
-structure = parmed.openmm.load_topology(top.to_openmm(), system=sys)
+structure = parmed.openmm.load_topology(top.to_openmm(), system=sys, condense_atom_types=False)
structure.residues[0].name = "RES"
structure.save('test_thio.top')
This at least gets the next block of code running and prevents the two carbon types . It can have the downside you'd expect - having too many atom types:
$ grep "C[0-9]" test_thio.top
C1 6 12.010780 0.00000000 A 0.34806469 0.36350306
C2 6 12.010780 0.00000000 A 0.34806469 0.36350306
C3 6 12.010780 0.00000000 A 0.34806469 0.36350306
C4 6 12.010780 0.00000000 A 0.34806469 0.36350306
1 C1 1 RES C1x 1 -0.16887778 12.010780 ; qtot -0.168878
2 C2 1 RES C2x 2 -0.15677778 12.010780 ; qtot -0.325656
3 C3 1 RES C3x 3 -0.15677778 12.010780 ; qtot -0.482433
4 C4 1 RES C4x 4 -0.16887778 12.010780 ; qtot -0.651311
where you'd expect 2, this gives you 4, and you can see how that might not be workable for production systems. This is fine in OpenMM's eyes since it doesn't distinguish them to begin with, but IIRC some engines don't want to build the non-bonded interaction matrix with hundreds or thousands of atom types.
I'm not sure why the carbon atoms with different charges are condensed onto the same types. The piece of ParmEd code should be considering charges when asking itself if two OpenMM atoms should be assigned the same atom type.
The next release of the toolkit (0.11.0, RC package hopefully online within the month, full release a month after that) adds support for (bio)polymers and switches OpenMM system creation to our pseudo replacement for Parmed, Interchange. Non-biological polymers are certainly something we want to support better throughout our infrastructure, but if your work is timely, it might not be worth changing course from how you're currently wrestling existing tools.
To one of your original questions:
To my surprise I get 2 different H types, H1 and H2:
I'm not sure I have a great explanation, but the vdW terms match to unique parameters for hydrogens and only one for carbon:
In [70]: [v.parameter_type for v in forcefield['vdW'].find_matches(top).values()]
Out[70]:
[<vdWType with smirks: [#6:1] epsilon: 0.0868793154488 kcal/mol id: n14 rmin_half: 1.953447017081 A >,
<vdWType with smirks: [#6:1] epsilon: 0.0868793154488 kcal/mol id: n14 rmin_half: 1.953447017081 A >,
<vdWType with smirks: [#6:1] epsilon: 0.0868793154488 kcal/mol id: n14 rmin_half: 1.953447017081 A >,
<vdWType with smirks: [#6:1] epsilon: 0.0868793154488 kcal/mol id: n14 rmin_half: 1.953447017081 A >,
<vdWType with smirks: [#16:1] epsilon: 0.25 kcal/mol id: n21 rmin_half: 2.0 A >,
<vdWType with smirks: [#1:1]-[#6X3]~[#7,#8,#9,#16,#17,#35] epsilon: 0.01310699839698 kcal/mol id: n8 rmin_half: 1.377051329051 A >, # H on C right next to S
<vdWType with smirks: [#1:1]-[#6X3] epsilon: 0.01561134320353 kcal/mol id: n7 rmin_half: 1.443812569645 A >, # H on C farther away from S
<vdWType with smirks: [#1:1]-[#6X3] epsilon: 0.01561134320353 kcal/mol id: n7 rmin_half: 1.443812569645 A >,
<vdWType with smirks: [#1:1]-[#6X3]~[#7,#8,#9,#16,#17,#35] epsilon: 0.01310699839698 kcal/mol id: n8 rmin_half: 1.377051329051 A >]
but both "types" of carbons and hydrogens are assigned different partial charges in ways you'd expect:
In [71]: mol.assign_partial_charges(partial_charge_method="am1bcc")
In [72]: [(a.atomic_number, a.partial_charge._value) for a in mol.atoms]
Out[72]:
[(6, -0.168909997265372),
(6, -0.15659000031236145),
(6, -0.15659000031236145),
(6, -0.16886000386956665),
(16, 0.00702999946143892),
(1, 0.16964000411745575),
(1, 0.15231999703165558),
(1, 0.15231999703165558),
(1, 0.16964000411745575)]
something that, again, I would have expected ParmEd to pick up on. Adding the flag fixes it
In [91]: structure = parmed.openmm.load_topology(top.to_openmm(), system=sys)
In [92]: [a.atom_type.name for a in structure.atoms]
Out[92]: ['C1', 'C1', 'C1', 'C1', 'S1', 'H1', 'H2', 'H2', 'H1']
In [93]: structure.atoms[0].atom_type == structure.atoms[1].atom_type == structure.atoms[2].atom_type == structure.atoms[3].atom_type
Out[93]: True
In [94]: structure = parmed.openmm.load_topology(top.to_openmm(), system=sys, condense_atom_types=False)
In [95]: [a.atom_type.name for a in structure.atoms]
Out[95]: ['C1', 'C2', 'C3', 'C4', 'S1', 'H1', 'H2', 'H3', 'H4']
In [96]: structure.atoms[0].atom_type == structure.atoms[1].atom_type == structure.atoms[2].atom_type == structure.atoms[3].atom_type
Out[96]: False
but I wouldn't expect that to have been necessary since ParmEd should noticed the different charges from OpenMM.
At first I thought this might have something to do with aromaticity, I think an organic chemistry textbook would say thiophene is aromatic, but the MDL aromaticity model (OpenEye/RDKit) does not:
In [62]: oemol = oechem.OEGraphMol()
In [63]: oechem.OESmilesToMol(oemol, "c1cccs1")
Out[63]: True
In [64]: oechem.OEAssignAromaticFlags(oemol, oechem.OEAroModel_MDL)
In [65]: [atom.IsAromatic() for atom in oemol.GetAtoms()]
Out[65]: [False, False, False, False, False]
In [66]: [atom.is_aromatic for atom in Molecule.from_smiles("c1cccs1").atoms]
Out[66]: [False, False, False, False, False, False, False, False, False]
so while intuitively I'd think that all carbons and hydrogens in thiophene should have similar physics, MDL tagging them as non-aromatic means they're assigned different parameters. (One could just as easily write a force field that has SMIRKS patterns matching the two types of carbons here differently, but Sage doesn't do that, my guess is that their physics is not unique enough to justify that and/or thiophenes are already modeled better than other chemistries that got more focus.) I'm not sure how much that helps but hopefully it adds some context.
The next release of the toolkit (0.11.0, RC package hopefully online within the month, full release a month after that) adds support for (bio)polymers and switches OpenMM system creation to our pseudo replacement for Parmed, Interchange. Non-biological polymers are certainly something we want to support better throughout our infrastructure, but if your work is timely, it might not be worth changing course from how you're currently wrestling existing tools.
This sounds exciting. I am working on a new workflow for my students and I'm hitting a few walls similar to this one, so I can afford to wait a few months for something that can potentially simplify my life (and eventually theirs).
The piece of ParmEd code should be considering charges when asking itself if two OpenMM atoms should be assigned the same atom type.
👀
I tried passing my system throughbespokefit just to see, but I get the same result with 1 C type and 2 H types.
I will try my luck with gaff and GAFFTemplateGenerator where the thiophene atoms are assigned as cc and cd types.
Thanks for looking into this @mattwthompson
I wonder if these files are closer to what you'd expect? I didn't run a simulation with them but they pass a quick eye test and the C-C bond parameters look correct (or at least different).
These are generated using the latest development versions of Interchange and the toolkit, which likely look similar to what will be in the next release. This snippet bypasses ParmEd completely. Small molecules with well-defined chemistry (i.e. from SMILES or an SDF file) should work well, but support for (non-biological) polymers is not pretty at the moment.
from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
forcefield = ForceField("openff_unconstrained-2.0.0.offxml")
mol = Molecule.from_smiles('c1cccs1')
mol.generate_conformers(n_conformers=1)
top = Topology.from_molecules(mol)
sys = forcefield.create_interchange(top)
sys.to_top("out.top")
sys.to_gro("out.gro")
Thanks for following up. I am from mobile but from a quick look the topology shows still 2 H atom types and only one C type. Have you tried to write an amber topology file? That’s where I got stuck with a similar file last time.
Not recently but Interchange has Amber exports ready (to_prmtop / to_inpcrd) when its new version is released
As far as I can see these files have the same problem as the ones in my original issue. There are 2 kinds of C-C bonds and only one atom type, so parmed would refuse to write a prmtop file because of this ambiguity and grompp would likely also abort the run.
In the meantime I have gotten an openeye license so I will try again with that, but I believe this problem might be related to issue openforcefield/discussions#5 regarding aromaticity.