sourmash
sourmash copied to clipboard
thoughts on opportunities to speed up `CounterGather` and `sourmash gather`
In https://github.com/sourmash-bio/sourmash/issues/1771, I found a nice benchmark that helped showcase some ~simple performance problems with gather
and prefetch
, since fixed in https://github.com/sourmash-bio/sourmash/pull/2123 and https://github.com/sourmash-bio/sourmash/pull/2132.
For the moment, the dominant performance challenge in gather
is MinHash.remove_many(...)
, discussed in https://github.com/sourmash-bio/sourmash/issues/1617. I don't have the time or skillset to tackle that at the moment (🦀) but it's pretty well defined. That should be our first target for future speedups. But this issue is about the rest...
Somewhat ironically (or perhaps just "typically, for programmers"...) I had already invested in wholesale refactoring of CounterGather
over in https://github.com/sourmash-bio/sourmash/pull/2116, which didn't really address any of the performance issues. I had started working on
https://github.com/sourmash-bio/sourmash/pull/2113 before performance benchmarking revealed that I shouldn't bother. 🤷 😆
Still, I think some of the ideas in https://github.com/sourmash-bio/sourmash/pull/2113 were pretty good; I just don't think it's worth implementing them yet :). This issue is meant to document them for posterity.
Things to try:
- screen out identical matches - this should now be done, actually, because
CounterGather
now uses md5sum-based keys to store matches. But, worth verifying. - try out an in-memory/on-disk
SqliteIndex
-basedCounterGather
class. A simple implementation on top ofLCA_Database
is in already present in theCounterGather_LCA
class intests/test_index_protocol.py
. This might dramatically decrease memory without costing much time. - apply the threshold_bp threshold to the entire
CounterGather
class up front. This is partly implemented in https://github.com/sourmash-bio/sourmash/pull/2113 and might make a big difference for some queries; - try actually removing hashes from the stored
MinHash
objects inCounterGather
, rather than just recalculating the intersection as is currently done. Might save a lot of memory, at the cost of storing redundantMinHash
objects. - consider adaptive thresholding (mentioned in https://github.com/sourmash-bio/sourmash/issues/1594 and https://github.com/sourmash-bio/sourmash/issues/1578) for prefetch - briefly, the idea would be to support a higher scaled in the prefetch phase of things, while loading the original (non-downsampled) sketches into
CounterGather
. Could be a really dramatic speedup. Note that settingscaled=threshold_bp/5
might be a good heuristic to try here.
Note that SRR10988543
continues to be a white whale...