rd_filters icon indicating copy to clipboard operation
rd_filters copied to clipboard

Mismatching pattern due to RDKit aromaticity model

Open DrrDom opened this issue 2 years ago • 1 comments

I started to play with different filters and found that many compounds were rejected by some of them and started to investigate the cases. One example is Filter82_pyridinium rule ([c,n]1[c,n][c,n][c,n][c,n]n(C)1) from Inpharmatica set. RDKit aromatizes some compounds like in example below even with AROMATICITY_SIMPLE model. This results in matching the SMARTS pattern, what I consider a false positive result. The question is whether it was expected that this pattern should remove all such compounds or this should be relevant only for compounds with charged nitrogen ([c,n]1[c,n][c,n][c,n][c,n][n+](C)1)? Or there could be another workaround? Or this is more rdkit aromaticity model issue?

from rdkit import Chem

smi = 'COC1=C2N(C)C(=O)C3=C(OC(C)(C)C=C3)C2=CC=C1'
m = Chem.MolFromSmiles(smi, sanitize=False)
Chem.SanitizeMol(m, Chem.SANITIZE_ALL ^ Chem.SANITIZE_SETAROMATICITY)
Chem.SetAromaticity(m, Chem.AROMATICITY_SIMPLE)
print(Chem.MolToSmiles(m))

sma = '[c,n]1[c,n][c,n][c,n][c,n][n](C)1'   # 
pat = Chem.MolFromSmarts(sma)

print(m.GetSubstructMatch(pat))

output

COc1cccc2c3c(c(=O)n(C)c12)C=CC(C)(C)O3
(3, 16, 9, 8, 6, 4, 5)

DrrDom avatar Oct 07 '21 06:10 DrrDom

The patterns were taken directly from ChEMBL with a few tweaks to make them work with the RDKit. One day, when I get some time, I'll do some curation. I'd be happy to accept PRs from others who can improve the pattern.

PatWalters avatar Oct 16 '21 12:10 PatWalters