QCFractal "Loose" molecule index

The current molecule hash is effecitvely a identity check with the current fields:

Exact matches on "symbols", "multiplicity", "real", "fragments", "fragment_charges", and "fragment_multiplicities".
1.e-6 match on "mass"
1.e-4 match "charge"
1.e-8 match on "geometry"

This hash should be effectively a unique index allowing for a quick search of identical molecules. For more approximate searches that may return many molecules there have been several suggestions of new molecule hashes:

atom_count = "".join([sym + symbols.count(sym) for sym in set(symbols]) e.g., C6H6 for benzene. Very simple and possibly sieves queries down dramatically.
Similar to the current molecule_hash with a canonical symbol order, orientation, center of mass, and loose geometry match (~1e-2). This hash would allow quick identification of similar molecules.
SMILES - How do we obtain a "canonical" ordering that is deterministic.

Note that a hash search of a molecule is O(log(N)) while a direct comparison is O(N). As the database project expects on the order of O(1e8-9) molecules this difference is insurmountable.

New indices should be discussed and added on an as-needed basis.

Aug 17 '18 18:08 dgasmith

Came across chemfp this weekend. unfortunately the free version doesn't have py3 support. The author is the fellow whose blog posts taught me how to do the hover-over-graph-to-get-molecule-img-popup for BFDb. Seemed potentially relevant.

Aug 21 '18 21:08 loriab

Thanks, it looks interesting. Running a function on a query and every molecule in the database would be too expensive. We may be able to get by with a sieve + something like this, however.

On that note, we will probably bake something like cmiles in.

Aug 21 '18 21:08 dgasmith

Sure. The cmiles definitely looks more tightly focused. At least the chemfp is recorded, and I've closed a browser tab. :-)

Aug 21 '18 22:08 loriab