rdkit
rdkit copied to clipboard
Canonicalize shifts double bond out of conjugated system
Describe the bug The TautomerEnumerator.Canonicalize is creating a tautomer that is moving a double bond from out of a conjugated system further into a ring. This tautomer is not corroborated by OEChem or Moka tautomer tools.
To Reproduce
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
# v alpha-beta to carbonyl
smi = "O=CC1CCC=C(C(=O)c2ccccc2)C1"
mol = Chem.MolFromSmiles(smi)
print(Chem.MolToSmiles(mol))
assert Chem.MolToSmiles(mol) == smi
enumerator = rdMolStandardize.TautomerEnumerator()
canonical = enumerator.Canonicalize(mol)
print(Chem.MolToSmiles(canonical))
# v beta-gamma to carbonyl
canonical_smi = "O=CC1CC=CC(C(=O)c2ccccc2)C1"
# assert Chem.MolToSmiles(canonical) == canonical_smi
import openeye.oechem as oe
import openeye.oequacpac as oequacpac
# OE does not tautomerize this way
mol = oe.OEGraphMol()
oe.OESmilesToMol(mol, smi)
print(oe.OEMolToSmiles(mol)) # c1ccc(cc1)C(=O)C2=CCCC(C2)C=O
oequacpac.OEGetReasonableProtomer(mol)
print(oe.OEMolToSmiles(mol)) # c1ccc(cc1)C(=O)C2=CCCC(C2)C=O
# ^ doesn't shift
mol = oe.OEGraphMol()
oe.OESmilesToMol(mol, canonical_smi)
print(oe.OEMolToSmiles(mol)) # c1ccc(cc1)C(=O)C2CC(CC=C2)C=O
oequacpac.OEGetReasonableProtomer(mol)
print(oe.OEMolToSmiles(mol)) # c1ccc(cc1)C(=O)C2CC(CC=C2)C=O
# ^ doesn't shift
Expected behavior I expect the input molecule to stay the same in this Canonicalize call.
Screenshots
Configuration (please complete the following information):
- RDKit version: 2022.9.5; 2023.9.1
- OS: Ubuntu 20.04
- Python version (if relevant): py38, py311
- Are you using conda? no
- If you are using conda, which channel did you install the rdkit from? n/a
- If you are not using conda: how did you install the RDKit?
pip install rdkit
Additional context Possibly related to https://github.com/rdkit/rdkit/issues/5937, but this case is worse because we're going from SMILES, not 2D mols. And we're not creating an aromatic ring here.
This apparently is caused by the 1,5 (this)keto/enol f matching as seen below:
Smarts pattern: 1,5 (thio)keto/enol f [CX4,NX3;!H0]-[C]=[C][CH0]=[O,S,Se,Te;X1]
Shouldn't the actual tautomer here be:
I don't think that makes sense though, because here is the reference:
PT_02_00 |
[O,S,Se,Te;X1:1]=[Cz1H0:2][C:5]=[C:6][CX4z0,NX3:3][#1:4]>>[#1:4][O,S,Se,Te;X2:1][Cz1:2]=[C:5][C:6]=[Cz0,N:3]
Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459712/
If it was that tautomerization, I would expect to see
Yes, ditto.
Yes, we're in agreement :)
This is only a bug if the RDKit's tautomerization rules are being applied incorrectly or, possibly, if there's a bad rule (though there's a high bar on that one since we're using a published set of rules).
The canonical tautomer is intended to be canonical, not "nice", or low energy under any particular set of conditions.
I'm not very familiar with the codebase, but here are the rules:
std::make_tuple(
std::string("1,5 (thio)keto/enol f"),
std::string("[CX4z0,NX3;!H0]-[C]=[C][Cz1H0]=[O,S,Se,Te;X1]"),
std::string(""), std::string("")),
std::make_tuple(std::string("1,5 (thio)keto/enol r"),
std::string("[O,S,Se,Te;X2!H0]-[Cz1H0]=[C]-[C]=[Cz0,N]"),
std::string(""), std::string("")),
I don't see the r
SMARTS match in the rdkit output. There is no [O]-[C]
single bond in the current output. What is the code path that chooses how the "1,5 (thio)keto/enol f"
is transformed?
Is it possible that the 1,5 rule is combining with the 1,3 rule, making it go like the following?
alpha-beta unsat keto --[1,5 rule]--> beta-gamma unsat enol --[1,3 rule]--> beta-gamma unsat keto