openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

PDBs written using OpenEyeToolkitWrapper don't contain atom names from OFFMol

Open j-wags opened this issue 4 years ago • 2 comments

Describe the bug

from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import RDKitToolkitWrapper, OpenEyeToolkitWrapper
mol = Molecule.from_smiles('CCO')
mol.generate_unique_atom_names()
mol.to_file('test_oetkw.pdb', file_format='pdb', toolkit_registry=OpenEyeToolkitWrapper())

yields a pdb file without unique atom names

ATOM      1  C   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      2  C   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      3  O   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      4  H   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      5  H   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      6  H   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      7  H   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      8  H   UNL     1       0.000   0.000   0.000  0.00  0.00 
ATOM      9  H   UNL     1       0.000   0.000   0.000  0.00  0.00 
CONECT    1    2    4    5    6
CONECT    2    3    7    8
CONECT    3    9
END

If we use RDKitToolkitWrapper,

mol.to_file('test_rdtkwk.pdb', file_format='pdb', toolkit_registry=RDKitToolkitWrapper())

we DO get a file with unique atom names

HETATM    1  C1  UNL     1       0.000   0.000   0.000  1.00  0.00           C  
HETATM    2  C2  UNL     1       0.000   0.000   0.000  1.00  0.00           C  
HETATM    3  O1  UNL     1       0.000   0.000   0.000  1.00  0.00           O  
HETATM    4  H1  UNL     1       0.000   0.000   0.000  1.00  0.00           H  
HETATM    5  H2  UNL     1       0.000   0.000   0.000  1.00  0.00           H  
HETATM    6  H3  UNL     1       0.000   0.000   0.000  1.00  0.00           H  
HETATM    7  H4  UNL     1       0.000   0.000   0.000  1.00  0.00           H  
HETATM    8  H5  UNL     1       0.000   0.000   0.000  1.00  0.00           H  
HETATM    9  H6  UNL     1       0.000   0.000   0.000  1.00  0.00           H  
CONECT    1    2    4    5    6
CONECT    2    3    7    8
CONECT    3    9
END

Computing environment (please complete the following information):

  • Mac OSX
openeye-toolkits          2020.1.0                 py37_0    openeye
openforcefield            0.8.0+129.g3c1663f1.dirty           dev_0    <develop>

Additional context I've emailed OpenEye about this, will update the issue when I hear back.

j-wags avatar Feb 19 '21 22:02 j-wags

The other issue is that the PDB file should also use HETATM records and not ATOM records.

For either of these PDB files to be compliant with the PDB spec, further records are required:

"For each het group that appears in the entry, the wwPDB checks that the corresponding HET, HETNAM, HETSYN, FORMUL, HETATM, and CONECT records appear, if applicable."

I believe this is one of the reasons we elected not to support PDB initially.

jchodera avatar Feb 19 '21 23:02 jchodera

Response from OE (This should unblock the PR above)

We are still working to get to the bottom of some of the behaviors that we are seeing but we at least have found a way to get the modified atom names reflected in the output file. So you are using OEWritePDBFile, which is the low-level writer as compared to OEWriteMolecule. OEWriteMolecule, as you may be aware of, detects that the output is PDB and calls OEWritePDBFile along with some additional perception routines to ensure a "properly" output molecule. When you using the low-level writer, we basically put the onus on the user to determine if the molecule is "suitable" for output and whether or not those additional perception routines need to be called. We noticed the formatting of the PDB file was different when using the low-level vs the high-level readers. The low-level reader writes methane as a protein, which is why the atom block starts with ATOM. Under these circumstances (and I'm still trying to get to the bottom of this), it appears to be blowing away the modified atom names.

COMPND methane ATOM 1 C UNL 1 0.000 0.000 0.000 0.00 0.00 C ATOM 2 H UNL 1 0.000 0.000 0.000 0.00 0.00 H ATOM 3 H UNL 1 0.000 0.000 0.000 0.00 0.00 H ATOM 4 H UNL 1 0.000 0.000 0.000 0.00 0.00 H ATOM 5 H UNL 1 0.000 0.000 0.000 0.00 0.00 H END

If you add in a call to

OEPerceiveResidues()

PRIOR to modifying the atom names, then methane is written out as a small molecule, which is why the atom block begins with HETATM. Under these circumstances, the modified atom names are properly reflected in the output file.

COMPND methane HETATM 1 X0 CH4 1 0.000 0.000 0.000 1.00 20.00 C HETATM 2 X1 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 3 X2 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 4 X3 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 5 X4 CH4 1 0.000 0.000 0.000 1.00 20.00 H TER 6 CH4 1 CONECT 1 2 3 4 5 CONECT 2 1 CONECT 3 1 CONECT 4 1 CONECT 5 1 END

When using the high-level writer instead of the low-level writer (without a call to OEPerceiveResidues, methane is detected as a small molecule but curiously, only the heavy atom carbon has the modified atom name reflected in the output file.

COMPND methane HETATM 1 X0 CH4 1 0.000 0.000 0.000 1.00 20.00 C HETATM 2 H 1 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 3 H 2 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 4 H 3 CH4 1 0.000 0.000 0.000 1.00 20.00 H HETATM 5 H 4 CH4 1 0.000 0.000 0.000 1.00 20.00 H TER 6 CH4 1 CONECT 1 2 3 4 5 CONECT 2 1 CONECT 3 1 CONECT 4 1 CONECT 5 1 END

If we put the OEPerceiveResidues call back in and use OEWriteMolecule, then we get an output file with the fully modified atom names. Now here is where it gets weirder. If we use another small molecule like aspirin, we see the same behavior when using the low-level writer without using OEPerceiveResidues. The molecule is written as a protein and none of the atom names are modified. However, if we use the high-level writer, then all of the atom names are modified, not just the heavy atom carbon. We believe this has something to do with methane being detected as a valid residue (i.e could be a protein) whereas aspirin is obviously detected as a small molecule. Again though, still working to get to the bottom of this.

[H]ere is the bottom line. For any small molecules, if you want to use the low-level writer, OEWritePDBFile(), you will need to call OEPerceiveResidues() in order to detect it as a small molecule instead of a protein. This way, the modified atom names will be reflected in the output file. If you choose to use the high-level writer, OEWriteMolecule(), you may or may not have to call OEPerceiveResidues, depending on the molecule (probably for other amino acids, cofactors, excipients, etc).

https://docs.eyesopen.com/toolkits/python/oechemtk/OEChemFunctions/OEPerceiveResidues.html

j-wags avatar Mar 16 '21 21:03 j-wags