Indigo
Indigo copied to clipboard
Bingo NoSQL incorrectly reports a similarity score of "1"
Using python epam.indigo 1.4.0b0
I've built a bingo NoSQL DB which is returning a similarity score of "1" (the max) when comparing the following different molecules (the first inchi is given as search input; both have been entered into the DB).
InChI=1S/C17H25N3O2/c18-9-14-2-1-3-20(14)15(21)10-19-16-5-12-4-13(6-16)8-17(22,7-12)11-16/h12-14,19,22H,1-8,10-11H2/t12?,13?,14-,16?,17?/m0/s1
Matching molecule:
InChI=1S/C18H24N4O2/c19-8-14-1-2-15(9-20)22(14)16(23)10-21-17-4-12-3-13(5-17)7-18(24,6-12)11-17/h12-15,21,24H,1-7,10-11H2/t12?,13?,14-,15+,17?,18?
Bingo responds with a similarity score of 1.0, which is obviously not correct (assuming "1" means an exact match). I would expect that extra triple-bonded nitrogen to have some downward impact on the score.
Here is the code which constructs the DB:
db = bingo.Bingo.createDatabaseFile(indigo, dbfile, 'molecule', '')
mol = indigo.loadMolecule(inchi)
try:
mol.standardize()
except Exception as e2:
pass
db.insert(mol, index_key)
Here is the code which does the similarity search:
simhits = []
indigo = get_indigo()
bb = bingo.Bingo.loadDatabaseFile(indigo, db_path)
try:
m = indigo.loadMolecule(inchi)
matcher = bb.searchSim(m, tanimoto_min, tanimoto_max, 'tanimoto')
while matcher.next():
simhits.append((matcher.getCurrentId(), matcher.getCurrentSimilarityValue()))
matcher.close()
bb.close()
except IndigoException as e:
logger.error(f"Can't calculate similarities on molecule '{q}' ({e})")
return simhits
Dear @twall Thank for for the bug report, I have reproduced the problem and plan to investigate it soon.
As a workaround, you now can use non-default fingerprint types by setting option:
indigo.setOption("similarity-type", sim_type)
bingo = Bingo.createDatabaseFile(indigo, dbPath, 'molecule', '')
Where sim_type
is one of:
-
"sim"
(default, looks buggy now) -
"ecfp2"
,"ecfp4"
,"ecfp6"
, or"ecfp8"
- ECFP fingerprints -
chem
Unfortunately you have to rebuild the database to update the fingerprints.
@mkviatkovskii thank you, is there documentation available for the sim_type
options?
@mkviatkovskii I've regenerated the bingo db using the "chem" similarity type, and still get false "1.0" matches.
In this case,
Aspirin: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
o-Formylphenoxyacetic acid
InChI=1S/C9H8O4/c10-5-7-3-1-2-4-8(7)13-6-9(11)12/h1-5H,6H2,(H,11,12)
O-Acetyl-p-hydroxybenzoic acid
InChI=1S/C9H8O4/c1-6(10)13-8-4-2-7(3-5-8)9(11)12/h2-5H,1H3,(H,11,12)
While the last example differs in the position of the bonds on the aromatic, I still wouldn't consider that an exact match. If by definition such differences are considered to be ignored, the Bingo documentation should make that clear, either in the description of the similarity methods or fingerprint descriptions.
Similarity score of 1.0 does not necessarily mean the exact match, it only means that all hashes of all sub-chains collided. So it's a case were we should consider improving fingerprints calculation, but not a bug.