Indigo icon indicating copy to clipboard operation
Indigo copied to clipboard

Bingo NoSQL incorrectly reports a similarity score of "1"

Open twall opened this issue 4 years ago • 4 comments

Using python epam.indigo 1.4.0b0

I've built a bingo NoSQL DB which is returning a similarity score of "1" (the max) when comparing the following different molecules (the first inchi is given as search input; both have been entered into the DB).

InChI=1S/C17H25N3O2/c18-9-14-2-1-3-20(14)15(21)10-19-16-5-12-4-13(6-16)8-17(22,7-12)11-16/h12-14,19,22H,1-8,10-11H2/t12?,13?,14-,16?,17?/m0/s1

Vildagliptin

Matching molecule:

InChI=1S/C18H24N4O2/c19-8-14-1-2-15(9-20)22(14)16(23)10-21-17-4-12-3-13(5-17)7-18(24,6-12)11-17/h12-15,21,24H,1-7,10-11H2/t12?,13?,14-,15+,17?,18?

CHEMBL207912

Bingo responds with a similarity score of 1.0, which is obviously not correct (assuming "1" means an exact match). I would expect that extra triple-bonded nitrogen to have some downward impact on the score.

Here is the code which constructs the DB:

db = bingo.Bingo.createDatabaseFile(indigo, dbfile, 'molecule', '')

mol = indigo.loadMolecule(inchi)
try:
    mol.standardize()
except Exception as e2:
    pass
db.insert(mol, index_key)

Here is the code which does the similarity search:

simhits = []
indigo = get_indigo()
bb = bingo.Bingo.loadDatabaseFile(indigo, db_path)
try:
    m = indigo.loadMolecule(inchi)
    matcher = bb.searchSim(m, tanimoto_min, tanimoto_max, 'tanimoto')
    while matcher.next():
        simhits.append((matcher.getCurrentId(), matcher.getCurrentSimilarityValue()))
    matcher.close()
    bb.close()
except IndigoException as e:
    logger.error(f"Can't calculate similarities on molecule '{q}' ({e})")
return simhits

twall avatar Oct 28 '20 21:10 twall

Dear @twall Thank for for the bug report, I have reproduced the problem and plan to investigate it soon.

As a workaround, you now can use non-default fingerprint types by setting option:

indigo.setOption("similarity-type", sim_type)
bingo = Bingo.createDatabaseFile(indigo, dbPath, 'molecule', '')

Where sim_type is one of:

  • "sim" (default, looks buggy now)
  • "ecfp2", "ecfp4", "ecfp6", or "ecfp8" - ECFP fingerprints
  • chem

Unfortunately you have to rebuild the database to update the fingerprints.

mkviatkovskii avatar Nov 03 '20 20:11 mkviatkovskii

@mkviatkovskii thank you, is there documentation available for the sim_type options?

twall avatar Nov 04 '20 15:11 twall

@mkviatkovskii I've regenerated the bingo db using the "chem" similarity type, and still get false "1.0" matches.

In this case,

Aspirin: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12) Aspirin o-Formylphenoxyacetic acid InChI=1S/C9H8O4/c10-5-7-3-1-2-4-8(7)13-6-9(11)12/h1-5H,6H2,(H,11,12) O-Acetyl-p-hydroxybenzoic acid InChI=1S/C9H8O4/c1-6(10)13-8-4-2-7(3-5-8)9(11)12/h2-5H,1H3,(H,11,12)

While the last example differs in the position of the bonds on the aromatic, I still wouldn't consider that an exact match. If by definition such differences are considered to be ignored, the Bingo documentation should make that clear, either in the description of the similarity methods or fingerprint descriptions.

twall avatar Dec 10 '20 18:12 twall

Similarity score of 1.0 does not necessarily mean the exact match, it only means that all hashes of all sub-chains collided. So it's a case were we should consider improving fingerprints calculation, but not a bug.

mkviatkovskii avatar Aug 01 '22 07:08 mkviatkovskii