sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Running `sourmash gather` in parallel/on many queries

Open krastegar opened this issue 3 years ago • 5 comments
trafficstars

Hi,

I have been struggling with finding an option to run on the sourmash gather command using multiple threads. Is this not available?

krastegar avatar May 30 '22 00:05 krastegar

hi @krastegar, the final part of the gather algorithm itself is not directly parallelizable, or at least not easily so. But there are things you can do. Read on...

As of #1370, the default implementation of gather consists of two parts: first, a sweep across the provided databases that finds all genomes that have an overlap with the query, and then an implementation of a greedy minimum set cover algorithm that chooses a (much) smaller set of non-redundant matches. The second part can't easily be parallelized.

However, the first part can be done in parallel using the prefetch command. The idea is that you can do the sweep across different subsets or shards of the search databases in parallel, and then combine the results and feed them back into gather.

This is not yet implemented natively in sourmash (see #1752 for details on this effort), but you can hack it together at the command line or use a workflow system to do it - see https://github.com/sourmash-bio/sourmash/issues/1664 if you're interested in a snakemake workflow. There are better/faster ways of doing this now, but I haven't updated #1664 with those; let me know if you're interested.

The only real problem with this solution is that it's not that much faster if you have a lot of matches - that is, if the prefetch sweep returns more than 30% of the database, as can happen with human microbiomes, I wouldn't expect a big speed increase.

ctb avatar May 31 '22 13:05 ctb

Thanks for getting back to me @ctb,

Sorry for not being specific with my question. I have a relatively large amount of kmer signatures and I was tasked with using gather on every single signature. I was able to speed up the process by using GNU parallel. ls *.sig | parallel -j33 --verbose --max-args=1 'sourmash gather {} gtdb-rs207.genomic.k31.lca.json.gz -o {}.csv' Since this would have been a sequential process, parallel worked great (and its a nice one liner). Hopefully this can help anyone else who runs into the issue

krastegar avatar Jun 01 '22 01:06 krastegar

excellent, glad to hear it!

ctb avatar Jun 01 '22 12:06 ctb

https://github.com/sourmash-bio/pyo3_branchwater is a plugin with a fast (multithreaded) implementation of multigather.

ctb avatar Sep 04 '23 13:09 ctb

as of sourmash_plugin_branchwater v0.9.5, sourmash scripts fastmultigather is a feature-complete multithreaded multi-query gather, and sourmash scripts fastgather is a feature-complete multithreaded single-query gather 🎉

I'll close this once I update the FAQ and documentation here.

ctb avatar Jun 29 '24 19:06 ctb