MMseqs2
MMseqs2 copied to clipboard
Easy-linclust runs for more than 24 hours on genome nucleotide sequences
Expected Behavior
Cluster nucleotide genome sequences (e.g., wgs records, contigs, scaffolds, complete genomes) in a few hours using easy-linclust
Current Behavior
Running easy-linclust
using the following command ran for more than 24 hours and was at the rescorediagonal
step for +20 hours:
mmseqs easy-linclust input.fna cluster tmp --min-seq-id 0.9 -c 0.9 --alignment-mode 3 --db-load-mode 2 --split-memory-limit 2000G
input.fna is 642 GB with approximately 2.5 MM sequences
Context
I want to cluster nucleotide genome sequences from NCBI to reduce the burden on similarity search for highly similar / redundant sequences.
Your Environment
- MMseqs Version: Commit f5f780acd64482cd59b46eae0a107f763cd17b4d (statically-compiled AVX2)
- Machine: 128 CPUs, 4 TB RAM, 2 x 1.9 TB NVMe SSD
- Operating system and version: Amazon Linux 2
Hi, how many hours does it take to finish?
@LittletreeZou Unfortunately without a progress bar I wasn't sure how much longer it needed to finish. I killed it after it ran for ~30 hours.