sourmash
sourmash copied to clipboard
gather with protein databases is very slow
@bluegenes and I have been talking about how slow protein gather is when used with genomes x gtdb. So I did some benchmarking, albeit with a metagenome. Read on!
scaled=200
sudo time py-spy record -o protein-scaled=200.svg -- sourmash gather SRR8859675-protein.sig.gz ../gtdb-rs207.protein-reps.k10.zip
reports:
1217.80 real 1094.29 user 107.63 sys
Here, most of the time is spent in intersection_and_union_size
Rust code:
![Screen Shot 2022-07-29 at 11 19 44 AM](https://user-images.githubusercontent.com/51016/181791460-fa8cfaa6-07b3-4680-b060-7dc5c0ab63dc.png)
scaled=1000
sudo time py-spy record -o protein.svg -- sourmash gather SRR8859675-protein.sig.gz --scaled=1000 ../gtdb-rs207.protein-reps.k10.zip
reports:
362.38 real 342.36 user 14.61 sys
flamegraph:
![Screen Shot 2022-07-29 at 11 20 24 AM](https://user-images.githubusercontent.com/51016/181791812-b92f510c-9255-4180-8da6-0751638f8e87.png)
The loading time increases in relative contribution to time, presumably because it stays constant while intersection_and_union_size
decreases so much.
I suspect what is happening is that there are many more overlaps that have significant overlap when using protein k-mers, so a lot more time is spent calculating the precise size of intersection/union.
note that with the same (default) --threshold-bp
of 50kb, sourmash gather
with DNA finds only 35 matches in GTDB genomic reps, while protein finds 2038:
36 197 12434 genome-prefetch.csv
2039 6115 755511 protein-prefetch.csv
so it seems very likely that protein search performance problems are related to finding MANY more matches.