rdkit icon indicating copy to clipboard operation
rdkit copied to clipboard

Canonicalize shifts double bond out of conjugated system

Open pechersky opened this issue 1 year ago • 9 comments

Describe the bug The TautomerEnumerator.Canonicalize is creating a tautomer that is moving a double bond from out of a conjugated system further into a ring. This tautomer is not corroborated by OEChem or Moka tautomer tools.

To Reproduce

from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize

#              v alpha-beta to carbonyl
smi = "O=CC1CCC=C(C(=O)c2ccccc2)C1"

mol = Chem.MolFromSmiles(smi)
print(Chem.MolToSmiles(mol))
assert Chem.MolToSmiles(mol) == smi

enumerator = rdMolStandardize.TautomerEnumerator()

canonical = enumerator.Canonicalize(mol)
print(Chem.MolToSmiles(canonical))
#                       v beta-gamma to carbonyl
canonical_smi = "O=CC1CC=CC(C(=O)c2ccccc2)C1"
# assert Chem.MolToSmiles(canonical) == canonical_smi

import openeye.oechem as oe
import openeye.oequacpac as oequacpac

# OE does not tautomerize this way

mol = oe.OEGraphMol()
oe.OESmilesToMol(mol, smi)
print(oe.OEMolToSmiles(mol))  # c1ccc(cc1)C(=O)C2=CCCC(C2)C=O
oequacpac.OEGetReasonableProtomer(mol)
print(oe.OEMolToSmiles(mol))  # c1ccc(cc1)C(=O)C2=CCCC(C2)C=O
#                                               ^ doesn't shift

mol = oe.OEGraphMol()
oe.OESmilesToMol(mol, canonical_smi)
print(oe.OEMolToSmiles(mol))  # c1ccc(cc1)C(=O)C2CC(CC=C2)C=O
oequacpac.OEGetReasonableProtomer(mol)
print(oe.OEMolToSmiles(mol))  # c1ccc(cc1)C(=O)C2CC(CC=C2)C=O
#                                                     ^ doesn't shift

Expected behavior I expect the input molecule to stay the same in this Canonicalize call.

Screenshots image

Configuration (please complete the following information):

  • RDKit version: 2022.9.5; 2023.9.1
  • OS: Ubuntu 20.04
  • Python version (if relevant): py38, py311
  • Are you using conda? no
  • If you are using conda, which channel did you install the rdkit from? n/a
  • If you are not using conda: how did you install the RDKit? pip install rdkit

Additional context Possibly related to https://github.com/rdkit/rdkit/issues/5937, but this case is worse because we're going from SMILES, not 2D mols. And we're not creating an aromatic ring here.

pechersky avatar Nov 16 '23 05:11 pechersky

This apparently is caused by the 1,5 (this)keto/enol f matching as seen below:

image

bp-kelley avatar Nov 16 '23 15:11 bp-kelley

Smarts pattern: 1,5 (thio)keto/enol f [CX4,NX3;!H0]-[C]=[C][CH0]=[O,S,Se,Te;X1]

bp-kelley avatar Nov 16 '23 15:11 bp-kelley

Shouldn't the actual tautomer here be:

image

bp-kelley avatar Nov 16 '23 16:11 bp-kelley

I don't think that makes sense though, because here is the reference: image

PT_02_00 |   [O,S,Se,Te;X1:1]=[Cz1H0:2][C:5]=[C:6][CX4z0,NX3:3][#1:4]>>[#1:4][O,S,Se,Te;X2:1][Cz1:2]=[C:5][C:6]=[Cz0,N:3]

Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459712/

If it was that tautomerization, I would expect to see image

pechersky avatar Nov 16 '23 16:11 pechersky

Yes, ditto.

pechersky avatar Nov 16 '23 16:11 pechersky

Yes, we're in agreement :)

bp-kelley avatar Nov 16 '23 16:11 bp-kelley

This is only a bug if the RDKit's tautomerization rules are being applied incorrectly or, possibly, if there's a bad rule (though there's a high bar on that one since we're using a published set of rules).

The canonical tautomer is intended to be canonical, not "nice", or low energy under any particular set of conditions.

greglandrum avatar Nov 16 '23 18:11 greglandrum

I'm not very familiar with the codebase, but here are the rules:

        std::make_tuple(
            std::string("1,5 (thio)keto/enol f"),
            std::string("[CX4z0,NX3;!H0]-[C]=[C][Cz1H0]=[O,S,Se,Te;X1]"),
            std::string(""), std::string("")),
        std::make_tuple(std::string("1,5 (thio)keto/enol r"),
                        std::string("[O,S,Se,Te;X2!H0]-[Cz1H0]=[C]-[C]=[Cz0,N]"),
                        std::string(""), std::string("")),

I don't see the r SMARTS match in the rdkit output. There is no [O]-[C] single bond in the current output. What is the code path that chooses how the "1,5 (thio)keto/enol f" is transformed?

pechersky avatar Nov 16 '23 18:11 pechersky

Is it possible that the 1,5 rule is combining with the 1,3 rule, making it go like the following?

alpha-beta unsat keto --[1,5 rule]--> beta-gamma unsat enol --[1,3 rule]--> beta-gamma unsat keto

pechersky avatar Nov 16 '23 18:11 pechersky