"Loose" molecule index
The current molecule hash is effecitvely a identity check with the current fields:
- Exact matches on "symbols", "multiplicity", "real", "fragments", "fragment_charges", and "fragment_multiplicities".
- 1.e-6 match on "mass"
- 1.e-4 match "charge"
- 1.e-8 match on "geometry"
This hash should be effectively a unique index allowing for a quick search of identical molecules. For more approximate searches that may return many molecules there have been several suggestions of new molecule hashes:
atom_count="".join([sym + symbols.count(sym) for sym in set(symbols])e.g.,C6H6for benzene. Very simple and possibly sieves queries down dramatically.- Similar to the current
molecule_hashwith a canonical symbol order, orientation, center of mass, and loose geometry match (~1e-2). This hash would allow quick identification of similar molecules. - SMILES - How do we obtain a "canonical" ordering that is deterministic.
Note that a hash search of a molecule is O(log(N)) while a direct comparison is O(N). As the database project expects on the order of O(1e8-9) molecules this difference is insurmountable.
New indices should be discussed and added on an as-needed basis.
Came across chemfp this weekend. unfortunately the free version doesn't have py3 support. The author is the fellow whose blog posts taught me how to do the hover-over-graph-to-get-molecule-img-popup for BFDb. Seemed potentially relevant.
Thanks, it looks interesting. Running a function on a query and every molecule in the database would be too expensive. We may be able to get by with a sieve + something like this, however.
On that note, we will probably bake something like cmiles in.
Sure. The cmiles definitely looks more tightly focused. At least the chemfp is recorded, and I've closed a browser tab. :-)