MolVS
MolVS copied to clipboard
TautomerCanonicalizer gives unexpected/forbidden form of phosphoric acid
I'm converting all the molecules in my database to canonical-tautomers and noticed that things like NADH looked weird. You can see it most plainly for phosphoric acid. I didn't expect the Hydrogen on the phosphorous. Is this the correct/expected behavior?
from rdkit import Chem
from rdkit.Chem import Draw
from molvs.tautomer import TautomerCanonicalizer
original_smiles = 'OP(=O)(O)O'
original_mol = Chem.MolFromSmiles(original_smiles)
tautomerized_mol = TautomerCanonicalizer().canonicalize(original_mol)
Draw.MolsToGridImage([original_mol,tautomerized_mol],
molsPerRow=3,subImgSize=(200,200),
legends=['original','tautomer'])
NADH looks like this
original_smiles = 'NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@H](N4C=NC5=C4N=CN=C5N)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1'
original_mol = Chem.MolFromSmiles(original_smiles)
tautomerized_mol = TautomerCanonicalizer().canonicalize(original_mol)
Draw.MolsToGridImage([original_mol,tautomerized_mol],
molsPerRow=1,subImgSize=(600,300),
legends=['original','tautomer'])
I think this is caused by the phosphonic acid rules: https://github.com/mcs07/MolVS/blob/master/molvs/tautomer.py#L130
It can probably be fixed by making the SMARTS pattern more strict to match only the intended target: https://en.wikipedia.org/wiki/Phosphorous_acid
You are correct, removing that rule stops that moiety from being modified. When you say, "more strict", you think specify an explicit number of bonds on the Phosphorous in the SMARTS pattern?
Why does rdkit allow 7 bonds on the phosphorous? Rdkit is a vast package, but looking at the definition of Phosphorous, it has max bonds of 5.
If I do SantizeMol, the hydrogen stays put. When I paste the structure into ChemDraw, its not valid.