MMseqs2
MMseqs2 copied to clipboard
Clustering billions of sequences gets stuck on kmermatcher
Expected Behavior
I am clustering billions of protein sequences. I already built the database. I was expecting linclust to run fairly quickly, but it seems to get stuck on the initial kmermatcher
step.
Current Behavior
Linclust is stuck on the kmermatcher step for several days. I'm running it on a 64 core machine with 409 GB of memory. I see 4 cores running while it loads the index into memory, but then it drops to 1 core and stays there for several days. Any advice on what may be going on?
Steps to Reproduce (for bugs)
I ran these commands:
mmseqs createdb INPUT/unique_proteins.faa OUTPUT/stringent/tmp/seqdb --dbtype 1 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
mmseqs linclust OUTPUT/stringent/tmp/seqdb OUTPUT/stringent/tmp/clu OUTPUT/stringent/tmp/clu_tmp --threads ${THREADS} -e 0.001 --min-seq-id 0.9 -c 0.9 --cov-mode 1 --spaced-kmer-mode 0 --remove-tmp-files 1
And it got stuck on the first kmermatcher
step.
MMseqs Output (for bugs)
Just this:
kmermatcher OUTPUT/stringent/tmp/seqdb OUTPUT/stringent/tmp/clu_tmp/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-i│
d 0.9 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c│
0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 │
│
MMseqs Version: 13.45111 │
Substitution matrix nucl:nucleotide.out,aa:blosum62.out │
Alphabet size nucl:5,aa:13 │
Seq. id. threshold 0.9 │
k-mers per sequence 21 │
Spaced k-mers 0 │
Spaced k-mer pattern │
Scale k-mers per sequence nucl:0.200,aa:0.000 │
Adjust k-mer length false │
Mask residues 0 │
Mask lower case residues 0 │
Coverage mode 1 │
k-mer length 0 │
Coverage threshold 0.9 │
Max sequence length 65535 │
Shift hash 67 │
Split memory limit 0 │
Include only extendable false │
Skip repeating k-mers false │
Threads 64 │
Compressed 0 │
Verbosity 3
Your Environment
I installed mmseqs using mamba
(conda).
The maximal size for one clustering can not be more than (2^32 - 1), which is roughly 4 billion sequences. To cluster 16 billion you need some kind of step wise clustering by splitting them into batches.
Any thoughts on how would you combine clusters across batches?