e3fp icon indicating copy to clipboard operation
e3fp copied to clipboard

Database I/O and conformer/fingerprint storage

Open aparente-nurix opened this issue 6 years ago • 3 comments

I'd like to use e3fp fingerprints on a very large database of molecules (~millions, possibly billions).

I was wondering if you had any benchmarks on speed and conformer/fingerprint storage sizes. Whats the largest dataset you've applied this to?

Thanks!

aparente-nurix avatar Aug 15 '19 01:08 aparente-nurix

The most comprehensive benchmarks we've run with E3FP are in Table S3 and Figure S10 of the supplement of the paper. I've included them below:

Screen Shot 2019-08-16 at 11 20 45 AM Screen Shot 2019-08-16 at 11 21 04 AM

The code that ran these benchmarks is here: https://github.com/keiserlab/e3fp-paper/tree/master/project/benchmark.

As you can see, we haven't rigorously benchmarked on more than 308,315 molecules (ChEMBL20). The runtime should scale linearly with database size. Note that when we scaled from 10,000 to 308,315 molecules, E3FP still takes on average ~83s and ~0.7s per molecule for conformer generation and fingerprinting, respectively. While runtime of fingerprinting scales sub-linearly with the number of heavy atoms, conformer generation scales super-linearly with the same heavy atoms, so if your database contains very large, flexible molecules (e.g. peptides), these will tend to take a long time to run conformer generation, and that could use up all of your processors.

sethaxen avatar Aug 16 '19 18:08 sethaxen

Regarding storage sizes, I haven't run any benchmarks in this area. E3FP's default storage approach is described here. Since it's just a light wrapper of a scipy.sparse.csr_matrix, its performance will be limited by that format. On the databases we've used, we are able to just hold the database in memory until fingerprinting is completed, when we write it to a file. I suspect a database with fingerprints of billions of molecules will exceed the memory of most machines, so a different storage option will probably be necessary, perhaps something like HDF5. I'm happy to take suggestions and pull requests in this area.

sethaxen avatar Aug 16 '19 19:08 sethaxen

great points. a couple thoughts:

  • for conformer generation, if speed is a concern, you might consider commercial packages like omega; e3fp doesn't fundamentally rely on our particular choice of confgen tool.
  • for more flexible storage formats, perhaps n5 or zarr

mjke avatar Aug 16 '19 19:08 mjke