genomad icon indicating copy to clipboard operation
genomad copied to clipboard

[feature request] query database clustering

Open valentynbez opened this issue 9 months ago • 1 comments

As mmseqs2 is already part of the pipeline, it would be nice to see an opportunity to cluster a query database before aligning it against the marker database. Phage genes are redundant and even 100% deduplication might shorten the computation time. Afterwards, the results can be mapped back from cluster representatives to initial sequences for downstream classification. It is not a useful feature for small datasets, but for larger ones, it can reduce computational time and RAM usage significantly.

valentynbez avatar Apr 28 '24 09:04 valentynbez

Great idea!

This would require a lot of benchmarks to evaluate:

  • Whether this should be turned on by default (maybe just for really big datasets) or if the performance improvement is only noticeable for date where we expect to see a lot of redundancy.
  • The similarity/coverage cutoffs, which would affect both the execution time and the annotation reliability.

Adding this would also involve rewriting a good amount of code, so it's not something that I can implement quickly. But I really like the idea and will evaluate it for future releases.

apcamargo avatar May 11 '24 03:05 apcamargo