MMseqs2 Easy-linclust runs for more than 24 hours on genome nucleotide sequences

Easy-linclust runs for more than 24 hours on genome nucleotide sequences

Open kevfly16 opened this issue 1 year ago • 2 comments

Cluster nucleotide genome sequences (e.g., wgs records, contigs, scaffolds, complete genomes) in a few hours using easy-linclust

Running easy-linclust using the following command ran for more than 24 hours and was at the rescorediagonal step for +20 hours:

mmseqs easy-linclust input.fna cluster tmp --min-seq-id 0.9 -c 0.9 --alignment-mode 3 --db-load-mode 2 --split-memory-limit 2000G

input.fna is 642 GB with approximately 2.5 MM sequences

I want to cluster nucleotide genome sequences from NCBI to reduce the burden on similarity search for highly similar / redundant sequences.

MMseqs Version: Commit f5f780acd64482cd59b46eae0a107f763cd17b4d (statically-compiled AVX2)
Machine: 128 CPUs, 4 TB RAM, 2 x 1.9 TB NVMe SSD
Operating system and version: Amazon Linux 2

Jun 30 '23 14:06 kevfly16

Hi, how many hours does it take to finish?

Dec 14 '23 11:12 LittletreeZou

@LittletreeZou Unfortunately without a progress bar I wasn't sure how much longer it needed to finish. I killed it after it ran for ~30 hours.

Dec 14 '23 15:12 kevfly16