openff-toolkit
openff-toolkit copied to clipboard
PDBs written using OpenEyeToolkitWrapper don't contain atom names from OFFMol
Describe the bug
from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper
mol = Molecule.from_smiles('CCO')
mol.generate_unique_atom_names()
mol.to_file('test_oetkw.pdb', file_format='pdb', toolkit_registry=OpenEyeToolkitWrapper())
yields a pdb file without unique atom names
ATOM 1 C UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 2 C UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 3 O UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 4 H UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 5 H UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 6 H UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 7 H UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 8 H UNL 1 0.000 0.000 0.000 0.00 0.00
ATOM 9 H UNL 1 0.000 0.000 0.000 0.00 0.00
CONECT 1 2 4 5 6
CONECT 2 3 7 8
CONECT 3 9
END
If we use RDKitToolkitWrapper,
mol.to_file('test_rdtkwk.pdb', file_format='pdb', toolkit_registry=RDKitToolkitWrapper())
we DO get a file with unique atom names
HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 3 O1 UNL 1 0.000 0.000 0.000 1.00 0.00 O
HETATM 4 H1 UNL 1 0.000 0.000 0.000 1.00 0.00 H
HETATM 5 H2 UNL 1 0.000 0.000 0.000 1.00 0.00 H
HETATM 6 H3 UNL 1 0.000 0.000 0.000 1.00 0.00 H
HETATM 7 H4 UNL 1 0.000 0.000 0.000 1.00 0.00 H
HETATM 8 H5 UNL 1 0.000 0.000 0.000 1.00 0.00 H
HETATM 9 H6 UNL 1 0.000 0.000 0.000 1.00 0.00 H
CONECT 1 2 4 5 6
CONECT 2 3 7 8
CONECT 3 9
END
Computing environment (please complete the following information):
- Mac OSX
openeye-toolkits 2020.1.0 py37_0 openeye
openforcefield 0.8.0+129.g3c1663f1.dirty dev_0 <develop>
Additional context I've emailed OpenEye about this, will update the issue when I hear back.
The other issue is that the PDB file should also use HETATM records and not ATOM records.
For either of these PDB files to be compliant with the PDB spec, further records are required:
"For each het group that appears in the entry, the wwPDB checks that the corresponding HET, HETNAM, HETSYN, FORMUL, HETATM, and CONECT records appear, if applicable."
I believe this is one of the reasons we elected not to support PDB initially.
Response from OE (This should unblock the PR above)
We are still working to get to the bottom of some of the behaviors that we are seeing but we at least have found a way to get the modified atom names reflected in the output file. So you are using OEWritePDBFile, which is the low-level writer as compared to OEWriteMolecule. OEWriteMolecule, as you may be aware of, detects that the output is PDB and calls OEWritePDBFile along with some additional perception routines to ensure a "properly" output molecule. When you using the low-level writer, we basically put the onus on the user to determine if the molecule is "suitable" for output and whether or not those additional perception routines need to be called. We noticed the formatting of the PDB file was different when using the low-level vs the high-level readers. The low-level reader writes methane as a protein, which is why the atom block starts with ATOM. Under these circumstances (and I'm still trying to get to the bottom of this), it appears to be blowing away the modified atom names.
COMPND methane ATOM 1 C UNL 1 0.000 0.000 0.000 0.00 0.00 C ATOM 2 H UNL 1 0.000 0.000 0.000 0.00 0.00 H ATOM 3 H UNL 1 0.000 0.000 0.000 0.00 0.00 H ATOM 4 H UNL 1 0.000 0.000 0.000 0.00 0.00 H ATOM 5 H UNL 1 0.000 0.000 0.000 0.00 0.00 H END
If you add in a call to
OEPerceiveResidues()
PRIOR to modifying the atom names, then methane is written out as a small molecule, which is why the atom block begins with HETATM. Under these circumstances, the modified atom names are properly reflected in the output file.
COMPND methane HETATM 1 X0 CH4 1 0.000 0.000 0.000 1.00 20.00 C HETATM 2 X1 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 3 X2 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 4 X3 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 5 X4 CH4 1 0.000 0.000 0.000 1.00 20.00 H TER 6 CH4 1 CONECT 1 2 3 4 5 CONECT 2 1 CONECT 3 1 CONECT 4 1 CONECT 5 1 END
When using the high-level writer instead of the low-level writer (without a call to OEPerceiveResidues, methane is detected as a small molecule but curiously, only the heavy atom carbon has the modified atom name reflected in the output file.
COMPND methane HETATM 1 X0 CH4 1 0.000 0.000 0.000 1.00 20.00 C HETATM 2 H 1 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 3 H 2 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 4 H 3 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 5 H 4 CH4 1 0.000 0.000 0.000 1.00 20.00 H TER 6 CH4 1 CONECT 1 2 3 4 5 CONECT 2 1 CONECT 3 1 CONECT 4 1 CONECT 5 1 END
If we put the OEPerceiveResidues call back in and use OEWriteMolecule, then we get an output file with the fully modified atom names. Now here is where it gets weirder. If we use another small molecule like aspirin, we see the same behavior when using the low-level writer without using OEPerceiveResidues. The molecule is written as a protein and none of the atom names are modified. However, if we use the high-level writer, then all of the atom names are modified, not just the heavy atom carbon. We believe this has something to do with methane being detected as a valid residue (i.e could be a protein) whereas aspirin is obviously detected as a small molecule. Again though, still working to get to the bottom of this.
[H]ere is the bottom line. For any small molecules, if you want to use the low-level writer, OEWritePDBFile(), you will need to call OEPerceiveResidues() in order to detect it as a small molecule instead of a protein. This way, the modified atom names will be reflected in the output file. If you choose to use the high-level writer, OEWriteMolecule(), you may or may not have to call OEPerceiveResidues, depending on the molecule (probably for other amino acids, cofactors, excipients, etc).
https://docs.eyesopen.com/toolkits/python/oechemtk/OEChemFunctions/OEPerceiveResidues.html