genomad
genomad copied to clipboard
[feature request] query database clustering
As mmseqs2
is already part of the pipeline, it would be nice to see an opportunity to cluster a query database before aligning it against the marker database.
Phage genes are redundant and even 100% deduplication might shorten the computation time. Afterwards, the results can be mapped back from cluster representatives to initial sequences for downstream classification.
It is not a useful feature for small datasets, but for larger ones, it can reduce computational time and RAM usage significantly.
Great idea!
This would require a lot of benchmarks to evaluate:
- Whether this should be turned on by default (maybe just for really big datasets) or if the performance improvement is only noticeable for date where we expect to see a lot of redundancy.
- The similarity/coverage cutoffs, which would affect both the execution time and the annotation reliability.
Adding this would also involve rewriting a good amount of code, so it's not something that I can implement quickly. But I really like the idea and will evaluate it for future releases.