openff-toolkit
openff-toolkit copied to clipboard
`OpenEyeToolkitWrapper.to_file` writes PDB files with no CONECT records
Describe the bug
A Molecule written to a PDB file includes CONECT records when OpenEye Toolkits are not available but excludes them when they are. This can lead to surprising results - or at least I did not expect the choice of toolkit backend to have this effect.
To Reproduce
from openff.toolkit import Molecule
from openff.toolkit.utils.toolkits import (
OpenEyeToolkitWrapper,
RDKitToolkitWrapper,
ToolkitRegistry,
)
molecule = Molecule.from_smiles("CC(=O)NC1=CC=C(C=C1)O")
molecule.generate_conformers(n_conformers=1)
molecule.to_file("via_openeye.pdb", "PDB", ToolkitRegistry([OpenEyeToolkitWrapper()]))
molecule.to_file("via_rdkit.pdb", "PDB", ToolkitRegistry([RDKitToolkitWrapper()]))
Output
$ python pdb_issue.py
$ wc -l *pdb
22 via_openeye.pdb
31 via_rdkit.pdb
53 total
$ cat via_openeye.pdb
ATOM 0 C UNL 1 1.405 -0.654 -3.811 1.00 20.00
ATOM 0 C UNL 1 0.989 -0.470 -2.373 1.00 20.00
ATOM 0 O UNL 1 -0.059 0.102 -2.091 1.00 20.00
ATOM 0 N UNL 1 1.928 -1.015 -1.507 1.00 20.00
ATOM 0 C UNL 1 1.885 -1.036 -0.112 1.00 20.00
ATOM 0 C UNL 1 2.921 -1.631 0.608 1.00 20.00
ATOM 0 C UNL 1 2.878 -1.652 2.002 1.00 20.00
ATOM 0 C UNL 1 1.800 -1.079 2.676 1.00 20.00
ATOM 0 C UNL 1 0.764 -0.484 1.955 1.00 20.00
ATOM 0 C UNL 1 0.807 -0.463 0.561 1.00 20.00
ATOM 0 O UNL 1 1.756 -1.102 4.036 1.00 20.00
ATOM 0 H UNL 1 2.007 0.201 -4.134 1.00 20.00
ATOM 0 H UNL 1 0.518 -0.724 -4.447 1.00 20.00
ATOM 0 H UNL 1 1.989 -1.573 -3.916 1.00 20.00
ATOM 0 H UNL 1 2.741 -1.450 -1.939 1.00 20.00
ATOM 0 H UNL 1 3.767 -2.080 0.093 1.00 20.00
ATOM 0 H UNL 1 3.688 -2.117 2.558 1.00 20.00
ATOM 0 H UNL 1 -0.079 -0.036 2.474 1.00 20.00
ATOM 0 H UNL 1 -0.006 0.003 0.010 1.00 20.00
ATOM 0 H UNL 1 2.542 -1.549 4.390 1.00 20.00
TER 1 UNL 1
END
$ cat via_rdkit.pdb
HETATM 1 C1 UNL 1 1.405 -0.654 -3.811 1.00 0.00 C
HETATM 2 C2 UNL 1 0.989 -0.470 -2.373 1.00 0.00 C
HETATM 3 O1 UNL 1 -0.059 0.102 -2.091 1.00 0.00 O
HETATM 4 N1 UNL 1 1.928 -1.015 -1.507 1.00 0.00 N
HETATM 5 C3 UNL 1 1.885 -1.036 -0.112 1.00 0.00 C
HETATM 6 C4 UNL 1 2.921 -1.631 0.608 1.00 0.00 C
HETATM 7 C5 UNL 1 2.878 -1.652 2.002 1.00 0.00 C
HETATM 8 C6 UNL 1 1.800 -1.079 2.676 1.00 0.00 C
HETATM 9 C7 UNL 1 0.764 -0.484 1.955 1.00 0.00 C
HETATM 10 C8 UNL 1 0.807 -0.463 0.561 1.00 0.00 C
HETATM 11 O2 UNL 1 1.756 -1.102 4.036 1.00 0.00 O
HETATM 12 H1 UNL 1 2.007 0.201 -4.134 1.00 0.00 H
HETATM 13 H2 UNL 1 0.518 -0.724 -4.447 1.00 0.00 H
HETATM 14 H3 UNL 1 1.989 -1.573 -3.916 1.00 0.00 H
HETATM 15 H4 UNL 1 2.741 -1.450 -1.939 1.00 0.00 H
HETATM 16 H5 UNL 1 3.767 -2.080 0.093 1.00 0.00 H
HETATM 17 H6 UNL 1 3.688 -2.117 2.558 1.00 0.00 H
HETATM 18 H7 UNL 1 -0.079 -0.036 2.474 1.00 0.00 H
HETATM 19 H8 UNL 1 -0.006 0.003 0.010 1.00 0.00 H
HETATM 20 H9 UNL 1 2.542 -1.549 4.390 1.00 0.00 H
CONECT 1 2 12 13 14
CONECT 2 3 3 4
CONECT 4 5 15
CONECT 5 6 10 10
CONECT 6 7 7 16
CONECT 7 8 17
CONECT 8 9 9 11
CONECT 9 10 18
CONECT 10 19
CONECT 11 20
END
Computing environment (please complete the following information):
pip install git+https://github.com/openforcefield/openff-toolkit.git@topology-biopolymer-refactor
probably pointing to commit 57a442b5c4403016df8770ca3dca64c5cb8bbc0a. I originally found this using #1276 but reproduced it on each branch.
Additional context
I didn't look too deeply into the root causes, test other molecules, etc., it can surely be reproduced with something simpler. I'm ripping this almost directly from an Evaluator unit test that is failing because of this (later, an MDTraj Topology lacks bonds where it's expected to have them, and I was able to trace this back to the toolkit writing the PDB file that MDTraj read back.)
I can never keep track of what the expected behavior with CONECT records is in the cacophony of ways they can come into existence. If there are any reliable expectations we do have, I'd be happy to write them up in documentation and and add some unit tests to ensure the right behavior (and that the behavior does not surprisingly vary with backend toolkit).
Thanks for reporting this. I think this is a legitimate bug.
Big-picture, this molecule isn't a protein or a correctly labeled polymer and shouldn't be getting written out using ATOM lines at all - HETATM is the correct way to indicate that this is an arbitrary molecule that may not conform to known residue templates. I vaguely recall that, if OE thinks that something is made of ATOMs, then it doesn't get CONECT records. But if it's made of HETATMs, it does.
Putting on my thinking cap, the root cause of this is probably the same as the stalled PR https://github.com/openforcefield/openff-toolkit/pull/848 - the exchange from OE in the comments there may have the key to fixing this.
Per our meeting today - @mattwthompson will take over #848 IF it seems like the atom-name and ATOM vs. HETATM problems can be solved in the same way (though it's probably best at this point to close #848 and just copy the changes to something more recent)