openff-toolkit thiophene gets only 1 C atom type, but 2 different C-C bond lengths

Hi folks, I am working towards a python-based workflow for the parameterisation of a library of aromatic polymer fragments. In the past, I used exclusively ambertools/tleap but I would like to speed up and automatise my workflow, hence my interest in openff-toolkit.

Describe the bug

In the example below, I try to get a topology file for thiophene starting from SMILES. However the resulting file (attached) contains only 1 carbon type, C1. To my surprise I get 2 different H types, H1 and H2:

; name    at.num    mass    charge ptype  sigma      epsilon
C1             6  12.010780  0.00000000  A     0.34806469     0.36350306
S1            16  32.065500  0.00000000  A     0.35635949          1.046
H1             1   1.007947  0.00000000  A     0.24536265    0.054839681
H2             1   1.007947  0.00000000  A     0.25725815     0.06531786

on the other hand, the topology correctly differentiates between the 2 kinds of C-C bonds:

[ bonds ]
;    ai     aj funct         c0         c1         c2         c3
      1      2     1   0.13717 334016.498332
      2      3     1   0.14546 233580.248618
      3      4     1   0.13717 334016.498332
      4      5     1   0.17602 225745.023800
      5      1     1   0.17602 225745.023800
      1      6     1   0.10854 332422.631675
      2      7     1   0.10854 332422.631675
      3      8     1   0.10854 332422.631675
      4      9     1   0.10854 332422.631675

To Reproduce

import parmed
from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
forcefield = ForceField("openff_unconstrained-2.0.0.offxml")
mol = Molecule.from_smiles('c1cccs1')

top = Topology.from_molecules(mol)

sys = forcefield.create_openmm_system(top)

structure = parmed.openmm.load_topology(top.to_openmm(), system=sys)

structure.residues[0].name = "RES"
structure.save('test_thio.top')

This ambiguity in the topology results in an error when I try and write amber files with parmed:

structure.save('test_thio.prmtop') 
test = parmed.load_file('test_thio.prmtop')
parmed.tools.actions.writeFrcmod(test, 'test_thio.frcmod').execute()

Output

---------------------------------------------------------------------------
ParameterError                            Traceback (most recent call last)
Input In [87], in <cell line: 1>()
----> 1 parmed.tools.actions.writeFrcmod(test, 'test_thio.frcmod').execute()

File ~/anaconda3/envs/mamba/envs/mdapsikit/lib/python3.9/site-packages/parmed/tools/actions.py:344, in writeFrcmod.execute(self)
    342 if not Action.overwrite and os.path.exists(self.frcmod_name):
    343     raise FileExists('%s exists; not overwriting' % self.frcmod_name)
--> 344 parmset = AmberParameterSet.from_structure(self.parm)
    345 title = 'Force field parameters from %s' % os.path.split(str(self.parm))[1]
    346 parmset.write(self.frcmod_name, title=title)

File ~/anaconda3/envs/mamba/envs/mdapsikit/lib/python3.9/site-packages/parmed/amber/parameters.py:364, in AmberParameterSet.from_structure(cls, struct)
    349 @classmethod
    350 def from_structure(cls, struct):
    351     """ Extracts known parameters from a Structure instance
    352 
    353     Parameters
   (...)
    362         The parameter set with all parameters defined in the Structure
    363     """
--> 364     return super(AmberParameterSet, cls).from_structure(struct, allow_unequal_duplicates=False)

File ~/anaconda3/envs/mamba/envs/mdapsikit/lib/python3.9/site-packages/parmed/parameters.py:228, in ParameterSet.from_structure(cls, struct, allow_unequal_duplicates)
    226 if key in params.bond_types:
    227     if not allow_unequal_duplicates and params.bond_types[key] != bond.type:
--> 228         raise ParameterError('Unequal bond types defined between %s and %s' % key)
    229     continue # pragma: no cover
    230 typ = copy(bond.type)

ParameterError: Unequal bond types defined between C1 and C1

Copying the parmed error is a bit of a tangent, but I hope you don't mind. My overall goal is to save each small fragment as a library file and build polymers by mixing and matching - not sure whether this could be achieved entirely through openff/openmm data structures?

Computing environment:

Mac OS 11.4
ambertools 21.11
openbabel 3.1.1
openff-bespokefit 0.1.1
openff-forcefields 2.0.0
openff-fragmenter-base 0.1.2
openff-qcsubmit 0.3.1
openff-toolkit 0.10.4
openff-toolkit-base 0.10.4
openff-utilities 0.1.3
openmm 7.7.0

Additional context

test_thio.top.zip

Apr 18 '22 22:04 micaela-matta

Thanks for the thorough report - I'm pretty sure this is a bug in ParmEd and/or this interface to it, but I haven't looked closely enough to say for sure. I'll have another look tomorrow and update you then.

My overall goal is to save each small fragment as a library file and build polymers by mixing and matching - not sure whether this could be achieved entirely through openff/openmm data structures?

I assume your fragments here are monomers of some sort within a larger, connected polymer - i.e. bonded to neighboring fragments (/residues/monomers)?

Apr 19 '22 02:04 mattwthompson

I'm pretty sure this is a bug in ParmEd and/or this interface to it

I saved the OpenMM system into an xml file and the treatment of Cs and Hs is consistent with what I see in ParmEd-saved topology files:

<Particle eps=".3635030558377792" q="-.1688777777777778" sig=".3480646886945065"/> #--> assigned type C1
<Particle eps=".3635030558377792" q="-.1567777777777778" sig=".3480646886945065"/> # --> C1 but I'd expect it to be C2
<Particle eps=".3635030558377792" q="-.1567777777777778" sig=".3480646886945065"/> # --> C1 but I'd expect it to be C2
<Particle eps=".3635030558377792" q="-.1688777777777778" sig=".3480646886945065"/> #--> assigned type C1
<Particle eps="1.046" q=".008422222222222216" sig=".35635948725613575"/> #--> S1
<Particle eps=".05483968129296432" q=".16922222222222222" sig=".2453626527729973"/> #--> assigned type H1
<Particle eps=".06531785996356952" q=".1522222222222222" sig=".25725815350632797"/>  #--> assigned type H2 (why?)
<Particle eps=".06531785996356952" q=".1522222222222222" sig=".25725815350632797"/>
<Particle eps=".05483968129296432" q=".16922222222222222" sig=".2453626527729973"/>

I assume your fragments here are monomers of some sort within a larger, connected polymer - i.e. bonded to neighboring fragments (/residues/monomers)?

exactly. I know this is currently not in scope, but I'm trying to see what I can retool from existing libraries/projects.

Apr 19 '22 08:04 micaela-matta

A coarse solution is to add this flag (https://github.com/ParmEd/ParmEd/pull/1087/files) when saving the files in the first place:

import parmed
from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
forcefield = ForceField("openff_unconstrained-2.0.0.offxml")
mol = Molecule.from_smiles('c1cccs1')

top = Topology.from_molecules(mol)

sys = forcefield.create_openmm_system(top)

-structure = parmed.openmm.load_topology(top.to_openmm(), system=sys)
+structure = parmed.openmm.load_topology(top.to_openmm(), system=sys, condense_atom_types=False)

structure.residues[0].name = "RES"
structure.save('test_thio.top')

This at least gets the next block of code running and prevents the two carbon types . It can have the downside you'd expect - having too many atom types:

$ grep "C[0-9]" test_thio.top
C1             6  12.010780  0.00000000  A     0.34806469     0.36350306
C2             6  12.010780  0.00000000  A     0.34806469     0.36350306
C3             6  12.010780  0.00000000  A     0.34806469     0.36350306
C4             6  12.010780  0.00000000  A     0.34806469     0.36350306
    1         C1      1    RES    C1x      1 -0.16887778  12.010780   ; qtot -0.168878
    2         C2      1    RES    C2x      2 -0.15677778  12.010780   ; qtot -0.325656
    3         C3      1    RES    C3x      3 -0.15677778  12.010780   ; qtot -0.482433
    4         C4      1    RES    C4x      4 -0.16887778  12.010780   ; qtot -0.651311

where you'd expect 2, this gives you 4, and you can see how that might not be workable for production systems. This is fine in OpenMM's eyes since it doesn't distinguish them to begin with, but IIRC some engines don't want to build the non-bonded interaction matrix with hundreds or thousands of atom types.

I'm not sure why the carbon atoms with different charges are condensed onto the same types. The piece of ParmEd code should be considering charges when asking itself if two OpenMM atoms should be assigned the same atom type.

The next release of the toolkit (0.11.0, RC package hopefully online within the month, full release a month after that) adds support for (bio)polymers and switches OpenMM system creation to our pseudo replacement for Parmed, Interchange. Non-biological polymers are certainly something we want to support better throughout our infrastructure, but if your work is timely, it might not be worth changing course from how you're currently wrestling existing tools.

Apr 19 '22 13:04 mattwthompson

To one of your original questions:

To my surprise I get 2 different H types, H1 and H2:

I'm not sure I have a great explanation, but the vdW terms match to unique parameters for hydrogens and only one for carbon:

In [70]: [v.parameter_type for v in forcefield['vdW'].find_matches(top).values()]
Out[70]:
[<vdWType with smirks: [#6:1]  epsilon: 0.0868793154488 kcal/mol  id: n14  rmin_half: 1.953447017081 A  >,
 <vdWType with smirks: [#6:1]  epsilon: 0.0868793154488 kcal/mol  id: n14  rmin_half: 1.953447017081 A  >,
 <vdWType with smirks: [#6:1]  epsilon: 0.0868793154488 kcal/mol  id: n14  rmin_half: 1.953447017081 A  >,
 <vdWType with smirks: [#6:1]  epsilon: 0.0868793154488 kcal/mol  id: n14  rmin_half: 1.953447017081 A  >,
 <vdWType with smirks: [#16:1]  epsilon: 0.25 kcal/mol  id: n21  rmin_half: 2.0 A  >,
 <vdWType with smirks: [#1:1]-[#6X3]~[#7,#8,#9,#16,#17,#35]  epsilon: 0.01310699839698 kcal/mol  id: n8  rmin_half: 1.377051329051 A  >,  # H on C right next to S
 <vdWType with smirks: [#1:1]-[#6X3]  epsilon: 0.01561134320353 kcal/mol  id: n7  rmin_half: 1.443812569645 A  >,  # H on C farther away from S
 <vdWType with smirks: [#1:1]-[#6X3]  epsilon: 0.01561134320353 kcal/mol  id: n7  rmin_half: 1.443812569645 A  >,
 <vdWType with smirks: [#1:1]-[#6X3]~[#7,#8,#9,#16,#17,#35]  epsilon: 0.01310699839698 kcal/mol  id: n8  rmin_half: 1.377051329051 A  >]

but both "types" of carbons and hydrogens are assigned different partial charges in ways you'd expect:

In [71]: mol.assign_partial_charges(partial_charge_method="am1bcc")

In [72]: [(a.atomic_number, a.partial_charge._value) for a in mol.atoms]
Out[72]:
[(6, -0.168909997265372),
 (6, -0.15659000031236145),
 (6, -0.15659000031236145),
 (6, -0.16886000386956665),
 (16, 0.00702999946143892),
 (1, 0.16964000411745575),
 (1, 0.15231999703165558),
 (1, 0.15231999703165558),
 (1, 0.16964000411745575)]

something that, again, I would have expected ParmEd to pick up on. Adding the flag fixes it

In [91]: structure = parmed.openmm.load_topology(top.to_openmm(), system=sys)

In [92]: [a.atom_type.name for a in structure.atoms]
Out[92]: ['C1', 'C1', 'C1', 'C1', 'S1', 'H1', 'H2', 'H2', 'H1']

In [93]: structure.atoms[0].atom_type == structure.atoms[1].atom_type == structure.atoms[2].atom_type == structure.atoms[3].atom_type
Out[93]: True

In [94]: structure = parmed.openmm.load_topology(top.to_openmm(), system=sys, condense_atom_types=False)

In [95]: [a.atom_type.name for a in structure.atoms]
Out[95]: ['C1', 'C2', 'C3', 'C4', 'S1', 'H1', 'H2', 'H3', 'H4']

In [96]: structure.atoms[0].atom_type == structure.atoms[1].atom_type == structure.atoms[2].atom_type == structure.atoms[3].atom_type
Out[96]: False

but I wouldn't expect that to have been necessary since ParmEd should noticed the different charges from OpenMM.

At first I thought this might have something to do with aromaticity, I think an organic chemistry textbook would say thiophene is aromatic, but the MDL aromaticity model (OpenEye/RDKit) does not:

In [62]: oemol = oechem.OEGraphMol()

In [63]: oechem.OESmilesToMol(oemol, "c1cccs1")
Out[63]: True

In [64]: oechem.OEAssignAromaticFlags(oemol, oechem.OEAroModel_MDL)

In [65]: [atom.IsAromatic() for atom in oemol.GetAtoms()]
Out[65]: [False, False, False, False, False]

In [66]: [atom.is_aromatic for atom in Molecule.from_smiles("c1cccs1").atoms]
Out[66]: [False, False, False, False, False, False, False, False, False]

so while intuitively I'd think that all carbons and hydrogens in thiophene should have similar physics, MDL tagging them as non-aromatic means they're assigned different parameters. (One could just as easily write a force field that has SMIRKS patterns matching the two types of carbons here differently, but Sage doesn't do that, my guess is that their physics is not unique enough to justify that and/or thiophenes are already modeled better than other chemistries that got more focus.) I'm not sure how much that helps but hopefully it adds some context.

Apr 19 '22 13:04 mattwthompson

The next release of the toolkit (0.11.0, RC package hopefully online within the month, full release a month after that) adds support for (bio)polymers and switches OpenMM system creation to our pseudo replacement for Parmed, Interchange. Non-biological polymers are certainly something we want to support better throughout our infrastructure, but if your work is timely, it might not be worth changing course from how you're currently wrestling existing tools.

This sounds exciting. I am working on a new workflow for my students and I'm hitting a few walls similar to this one, so I can afford to wait a few months for something that can potentially simplify my life (and eventually theirs).

The piece of ParmEd code should be considering charges when asking itself if two OpenMM atoms should be assigned the same atom type.

👀 I tried passing my system throughbespokefit just to see, but I get the same result with 1 C type and 2 H types. I will try my luck with gaff and GAFFTemplateGenerator where the thiophene atoms are assigned as cc and cd types.

Thanks for looking into this @mattwthompson

Apr 19 '22 15:04 micaela-matta

I wonder if these files are closer to what you'd expect? I didn't run a simulation with them but they pass a quick eye test and the C-C bond parameters look correct (or at least different).

out.gro.txt out.top.txt

These are generated using the latest development versions of Interchange and the toolkit, which likely look similar to what will be in the next release. This snippet bypasses ParmEd completely. Small molecules with well-defined chemistry (i.e. from SMILES or an SDF file) should work well, but support for (non-biological) polymers is not pretty at the moment.

from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField

forcefield = ForceField("openff_unconstrained-2.0.0.offxml")

mol = Molecule.from_smiles('c1cccs1')
mol.generate_conformers(n_conformers=1)
top = Topology.from_molecules(mol)

sys = forcefield.create_interchange(top)

sys.to_top("out.top")
sys.to_gro("out.gro")

Jun 29 '22 20:06 mattwthompson

Thanks for following up. I am from mobile but from a quick look the topology shows still 2 H atom types and only one C type. Have you tried to write an amber topology file? That’s where I got stuck with a similar file last time.

Jun 30 '22 07:06 micaela-matta

Not recently but Interchange has Amber exports ready (to_prmtop / to_inpcrd) when its new version is released

Jun 30 '22 13:06 mattwthompson

As far as I can see these files have the same problem as the ones in my original issue. There are 2 kinds of C-C bonds and only one atom type, so parmed would refuse to write a prmtop file because of this ambiguity and grompp would likely also abort the run.

In the meantime I have gotten an openeye license so I will try again with that, but I believe this problem might be related to issue openforcefield/discussions#5 regarding aromaticity.

Jun 30 '22 17:06 micaela-matta

openff-toolkit openff-toolkit copied to clipboard

thiophene gets only 1 C atom type, but 2 different C-C bond lengths

openff-toolkit
openff-toolkit copied to clipboard