openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

`OpenEyeToolkitWrapper.to_file` writes PDB files with no CONECT records

Open mattwthompson opened this issue 3 years ago • 2 comments

Describe the bug A Molecule written to a PDB file includes CONECT records when OpenEye Toolkits are not available but excludes them when they are. This can lead to surprising results - or at least I did not expect the choice of toolkit backend to have this effect.

To Reproduce

from openff.toolkit import Molecule
from openff.toolkit.utils.toolkits import (
    OpenEyeToolkitWrapper,
    RDKitToolkitWrapper,
    ToolkitRegistry,
)

molecule = Molecule.from_smiles("CC(=O)NC1=CC=C(C=C1)O")
molecule.generate_conformers(n_conformers=1)

molecule.to_file("via_openeye.pdb", "PDB", ToolkitRegistry([OpenEyeToolkitWrapper()]))
molecule.to_file("via_rdkit.pdb", "PDB", ToolkitRegistry([RDKitToolkitWrapper()]))

Output

$ python pdb_issue.py
$ wc -l *pdb
      22 via_openeye.pdb
      31 via_rdkit.pdb
      53 total
$ cat via_openeye.pdb
ATOM      0  C   UNL     1       1.405  -0.654  -3.811  1.00 20.00
ATOM      0  C   UNL     1       0.989  -0.470  -2.373  1.00 20.00
ATOM      0  O   UNL     1      -0.059   0.102  -2.091  1.00 20.00
ATOM      0  N   UNL     1       1.928  -1.015  -1.507  1.00 20.00
ATOM      0  C   UNL     1       1.885  -1.036  -0.112  1.00 20.00
ATOM      0  C   UNL     1       2.921  -1.631   0.608  1.00 20.00
ATOM      0  C   UNL     1       2.878  -1.652   2.002  1.00 20.00
ATOM      0  C   UNL     1       1.800  -1.079   2.676  1.00 20.00
ATOM      0  C   UNL     1       0.764  -0.484   1.955  1.00 20.00
ATOM      0  C   UNL     1       0.807  -0.463   0.561  1.00 20.00
ATOM      0  O   UNL     1       1.756  -1.102   4.036  1.00 20.00
ATOM      0  H   UNL     1       2.007   0.201  -4.134  1.00 20.00
ATOM      0  H   UNL     1       0.518  -0.724  -4.447  1.00 20.00
ATOM      0  H   UNL     1       1.989  -1.573  -3.916  1.00 20.00
ATOM      0  H   UNL     1       2.741  -1.450  -1.939  1.00 20.00
ATOM      0  H   UNL     1       3.767  -2.080   0.093  1.00 20.00
ATOM      0  H   UNL     1       3.688  -2.117   2.558  1.00 20.00
ATOM      0  H   UNL     1      -0.079  -0.036   2.474  1.00 20.00
ATOM      0  H   UNL     1      -0.006   0.003   0.010  1.00 20.00
ATOM      0  H   UNL     1       2.542  -1.549   4.390  1.00 20.00
TER       1      UNL     1
END
$ cat via_rdkit.pdb
HETATM    1  C1  UNL     1       1.405  -0.654  -3.811  1.00  0.00           C
HETATM    2  C2  UNL     1       0.989  -0.470  -2.373  1.00  0.00           C
HETATM    3  O1  UNL     1      -0.059   0.102  -2.091  1.00  0.00           O
HETATM    4  N1  UNL     1       1.928  -1.015  -1.507  1.00  0.00           N
HETATM    5  C3  UNL     1       1.885  -1.036  -0.112  1.00  0.00           C
HETATM    6  C4  UNL     1       2.921  -1.631   0.608  1.00  0.00           C
HETATM    7  C5  UNL     1       2.878  -1.652   2.002  1.00  0.00           C
HETATM    8  C6  UNL     1       1.800  -1.079   2.676  1.00  0.00           C
HETATM    9  C7  UNL     1       0.764  -0.484   1.955  1.00  0.00           C
HETATM   10  C8  UNL     1       0.807  -0.463   0.561  1.00  0.00           C
HETATM   11  O2  UNL     1       1.756  -1.102   4.036  1.00  0.00           O
HETATM   12  H1  UNL     1       2.007   0.201  -4.134  1.00  0.00           H
HETATM   13  H2  UNL     1       0.518  -0.724  -4.447  1.00  0.00           H
HETATM   14  H3  UNL     1       1.989  -1.573  -3.916  1.00  0.00           H
HETATM   15  H4  UNL     1       2.741  -1.450  -1.939  1.00  0.00           H
HETATM   16  H5  UNL     1       3.767  -2.080   0.093  1.00  0.00           H
HETATM   17  H6  UNL     1       3.688  -2.117   2.558  1.00  0.00           H
HETATM   18  H7  UNL     1      -0.079  -0.036   2.474  1.00  0.00           H
HETATM   19  H8  UNL     1      -0.006   0.003   0.010  1.00  0.00           H
HETATM   20  H9  UNL     1       2.542  -1.549   4.390  1.00  0.00           H
CONECT    1    2   12   13   14
CONECT    2    3    3    4
CONECT    4    5   15
CONECT    5    6   10   10
CONECT    6    7    7   16
CONECT    7    8   17
CONECT    8    9    9   11
CONECT    9   10   18
CONECT   10   19
CONECT   11   20
END

Computing environment (please complete the following information):

pip install git+https://github.com/openforcefield/openff-toolkit.git@topology-biopolymer-refactor

probably pointing to commit 57a442b5c4403016df8770ca3dca64c5cb8bbc0a. I originally found this using #1276 but reproduced it on each branch.

Additional context

I didn't look too deeply into the root causes, test other molecules, etc., it can surely be reproduced with something simpler. I'm ripping this almost directly from an Evaluator unit test that is failing because of this (later, an MDTraj Topology lacks bonds where it's expected to have them, and I was able to trace this back to the toolkit writing the PDB file that MDTraj read back.)

I can never keep track of what the expected behavior with CONECT records is in the cacophony of ways they can come into existence. If there are any reliable expectations we do have, I'd be happy to write them up in documentation and and add some unit tests to ensure the right behavior (and that the behavior does not surprisingly vary with backend toolkit).

mattwthompson avatar May 31 '22 22:05 mattwthompson

Thanks for reporting this. I think this is a legitimate bug.

Big-picture, this molecule isn't a protein or a correctly labeled polymer and shouldn't be getting written out using ATOM lines at all - HETATM is the correct way to indicate that this is an arbitrary molecule that may not conform to known residue templates. I vaguely recall that, if OE thinks that something is made of ATOMs, then it doesn't get CONECT records. But if it's made of HETATMs, it does.

Putting on my thinking cap, the root cause of this is probably the same as the stalled PR https://github.com/openforcefield/openff-toolkit/pull/848 - the exchange from OE in the comments there may have the key to fixing this.

j-wags avatar Jun 01 '22 16:06 j-wags

Per our meeting today - @mattwthompson will take over #848 IF it seems like the atom-name and ATOM vs. HETATM problems can be solved in the same way (though it's probably best at this point to close #848 and just copy the changes to something more recent)

j-wags avatar Jun 01 '22 21:06 j-wags