sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

thoughts on opportunities to speed up `CounterGather` and `sourmash gather`

Open ctb opened this issue 1 year ago • 0 comments

In https://github.com/sourmash-bio/sourmash/issues/1771, I found a nice benchmark that helped showcase some ~simple performance problems with gather and prefetch, since fixed in https://github.com/sourmash-bio/sourmash/pull/2123 and https://github.com/sourmash-bio/sourmash/pull/2132.

For the moment, the dominant performance challenge in gather is MinHash.remove_many(...), discussed in https://github.com/sourmash-bio/sourmash/issues/1617. I don't have the time or skillset to tackle that at the moment (🦀) but it's pretty well defined. That should be our first target for future speedups. But this issue is about the rest...

Somewhat ironically (or perhaps just "typically, for programmers"...) I had already invested in wholesale refactoring of CounterGather over in https://github.com/sourmash-bio/sourmash/pull/2116, which didn't really address any of the performance issues. I had started working on https://github.com/sourmash-bio/sourmash/pull/2113 before performance benchmarking revealed that I shouldn't bother. 🤷 😆

Still, I think some of the ideas in https://github.com/sourmash-bio/sourmash/pull/2113 were pretty good; I just don't think it's worth implementing them yet :). This issue is meant to document them for posterity.

Things to try:

  • screen out identical matches - this should now be done, actually, because CounterGather now uses md5sum-based keys to store matches. But, worth verifying.
  • try out an in-memory/on-disk SqliteIndex-based CounterGather class. A simple implementation on top of LCA_Database is in already present in the CounterGather_LCA class in tests/test_index_protocol.py. This might dramatically decrease memory without costing much time.
  • apply the threshold_bp threshold to the entire CounterGather class up front. This is partly implemented in https://github.com/sourmash-bio/sourmash/pull/2113 and might make a big difference for some queries;
  • try actually removing hashes from the stored MinHash objects in CounterGather, rather than just recalculating the intersection as is currently done. Might save a lot of memory, at the cost of storing redundant MinHash objects.
  • consider adaptive thresholding (mentioned in https://github.com/sourmash-bio/sourmash/issues/1594 and https://github.com/sourmash-bio/sourmash/issues/1578) for prefetch - briefly, the idea would be to support a higher scaled in the prefetch phase of things, while loading the original (non-downsampled) sketches into CounterGather. Could be a really dramatic speedup. Note that setting scaled=threshold_bp/5 might be a good heuristic to try here.

Note that SRR10988543 continues to be a white whale...

ctb avatar Jul 24 '22 11:07 ctb