MolecularGraph.jl icon indicating copy to clipboard operation
MolecularGraph.jl copied to clipboard

Bug in SMARTS queries

Open eahenle opened this issue 1 year ago • 4 comments

Given an input SMILES string and SMARTS query from the MACCS fingerprinting scheme (we are trying to implement this fingerprinting for the package) we found the following issue:

using MolecularGraph
mol = smilestomol("CCOP(=S)(OCC)Oc1cc(C)nc(C(C)C)n1")
query = smartstomol("[#6]=[#6](~[!#6;!#1])~[!#6;!#1]")
hassubstructmatch(mol, query) # returns true, but should return false!

Looking at the substructure match in Pluto, we see this:

begin
	matched1 = Set(Iterators.flatten(keys(m) for m in substructmatches(mol, query)))
	subg1 = MolecularGraph.nodesubgraph(mol, matched1)
	svg1 = MolecularGraph.drawsvg(mol, 300, 300, highlight=subg1)
	HTML(svg1)
end

image

This shows, I think, two problems:

  • The aromatic bonds are being assigned to single and double bonds and matched as such (even though they are also correctly identified as aromatic in other queries on the same structure)
  • The branching is not being handled correctly (the SMARTS query specifies that carbon must be bonded to two non-carbon atoms, but there is only one; it must be double-counting the nitrogen or mis-identifying the methyl carbon)

This is one example, but for this single structure, there are many MACCS keys that return false positive.

eahenle avatar Sep 16 '22 23:09 eahenle

Thank you for the catch. Maybe SMARTS query is still not compatible with some advanced queries. I'm working on SMARTS in dev branch. Later I will check the current state.

mojaie avatar Sep 20 '22 00:09 mojaie

MACCS fingerprinting scheme (we are trying to implement this fingerprinting for the package)

I'm very happy to hear that!

mojaie avatar Sep 20 '22 00:09 mojaie

Here is the complete list of MACCS rules that are returning false-positive for the molecule shown above.

Each rule is a tuple that gives the SMARTS query and the count of matches that must be exceeded to turn the bit "on".

Tuple{String, Int64}[
("[#6]=[#6](~[!#6;!#1])~[!#6;!#1]", 0),
("[!#6;!#1]~[CH2]~[!#6;!#1]", 0),
("[!#6;!#1;!H0]~*~[!#6;!#1;!H0]", 0),
("[!#1;!#6;!#7;!#8;!#9;!#14;!#15;!#16;!#17;!#35;!#53]", 0),
("[#6]=[#6]~[#7]", 0),
("[!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0]", 0),
("[!#6;!#1;!H0]~*~*~[!#6;!#1;!H0]", 0),
("[!#6;!#1;!H0]~[!#6;!#1;!H0]", 0),
("[!#6;!#1]~[!#6;!#1;!H0]", 0),
("[!#6;!#1]~[#7]~[!#6;!#1]", 0),
("[#6]=[#6](~*)~*", 0),
("[#6]=[#7]", 0),
("*~[CH2]~[!#6;!#1;!H0]", 0),
("[C;H2,H3][!#6;!#1][C;H2,H3]", 0),
("[\$([!#6;!#1;!H0]~*~*~[CH2]~*),\$([!#6;!#1;!H0;R]1@[R]@[R]@[CH2;R]1),\$([!#6;!#1;!H0]~[R]1@[R]@[CH2;R]1)]", 0),
("[\$([!#6;!#1;!H0]~*~*~*~[CH2]~*),\$([!#6;!#1;!H0;R]1@[R]@[R]@[R]@[CH2;R]1),\$([!#6;!#1;!H0]~[R]1@[R]@[R]@[CH2;R]1),\$([!#6;!#1;!H0]~*~[R]1@[R]@[CH2;R]1)]", 0),
("[!#6;!#1]~[CH3]", 0),
("[!#6;!#1]~[#7]", 0),
("[#6]=[#6]", 0),
("[!#6;!#1;!H0]~*~[CH2]~*", 0),
("[#7]=*", 0),
("[!#6;!#1;!H0]", 1),
("*1~*~*~*~*~*~1", 1),
("[#6]-[#7]", 0)
]

eahenle avatar Sep 21 '22 17:09 eahenle

@eahenle queries you listed seems to return false at the new version (I checked it with v0.14.2).

mojaie avatar May 17 '23 08:05 mojaie