sourmash gather with protein databases is very slow

gather with protein databases is very slow

Open ctb opened this issue 1 year ago • 1 comments

@bluegenes and I have been talking about how slow protein gather is when used with genomes x gtdb. So I did some benchmarking, albeit with a metagenome. Read on!

scaled=200

sudo time py-spy record -o protein-scaled=200.svg -- sourmash gather SRR8859675-protein.sig.gz ../gtdb-rs207.protein-reps.k10.zip

reports:

1217.80 real      1094.29 user       107.63 sys

Here, most of the time is spent in intersection_and_union_size Rust code:

scaled=1000

sudo time py-spy record -o protein.svg -- sourmash gather SRR8859675-protein.sig.gz --scaled=1000 ../gtdb-rs207.protein-reps.k10.zip

reports:

362.38 real       342.36 user        14.61 sys

flamegraph:

The loading time increases in relative contribution to time, presumably because it stays constant while intersection_and_union_size decreases so much.

I suspect what is happening is that there are many more overlaps that have significant overlap when using protein k-mers, so a lot more time is spent calculating the precise size of intersection/union.

Jul 29 '22 15:07 ctb

note that with the same (default) --threshold-bp of 50kb, sourmash gather with DNA finds only 35 matches in GTDB genomic reps, while protein finds 2038:

    36    197  12434 genome-prefetch.csv
  2039   6115 755511 protein-prefetch.csv

so it seems very likely that protein search performance problems are related to finding MANY more matches.

Jul 29 '22 16:07 ctb

sourmash sourmash copied to clipboard

gather with protein databases is very slow

scaled=200

scaled=1000

sourmash
sourmash copied to clipboard