MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Clustering billions of sequences gets stuck on kmermatcher

Open durrantmm opened this issue 3 years ago • 2 comments

Expected Behavior

I am clustering billions of protein sequences. I already built the database. I was expecting linclust to run fairly quickly, but it seems to get stuck on the initial kmermatcher step.

Current Behavior

Linclust is stuck on the kmermatcher step for several days. I'm running it on a 64 core machine with 409 GB of memory. I see 4 cores running while it loads the index into memory, but then it drops to 1 core and stays there for several days. Any advice on what may be going on?

Steps to Reproduce (for bugs)

I ran these commands:

mmseqs createdb INPUT/unique_proteins.faa OUTPUT/stringent/tmp/seqdb --dbtype 1 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3

mmseqs linclust OUTPUT/stringent/tmp/seqdb OUTPUT/stringent/tmp/clu OUTPUT/stringent/tmp/clu_tmp --threads ${THREADS} -e 0.001 --min-seq-id 0.9 -c 0.9 --cov-mode 1 --spaced-kmer-mode 0 --remove-tmp-files 1

And it got stuck on the first kmermatcher step.

MMseqs Output (for bugs)

Just this:

kmermatcher OUTPUT/stringent/tmp/seqdb OUTPUT/stringent/tmp/clu_tmp/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-i│
d 0.9 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c│
 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3         │
                                                                                                                                                           │
MMseqs Version:                 13.45111                                                                                                                   │
Substitution matrix             nucl:nucleotide.out,aa:blosum62.out                                                                                        │
Alphabet size                   nucl:5,aa:13                                                                                                               │
Seq. id. threshold              0.9                                                                                                                        │
k-mers per sequence             21                                                                                                                         │
Spaced k-mers                   0                                                                                                                          │
Spaced k-mer pattern                                                                                                                                       │
Scale k-mers per sequence       nucl:0.200,aa:0.000                                                                                                        │
Adjust k-mer length             false                                                                                                                      │
Mask residues                   0                                                                                                                          │
Mask lower case residues        0                                                                                                                          │
Coverage mode                   1                                                                                                                          │
k-mer length                    0                                                                                                                          │
Coverage threshold              0.9                                                                                                                        │
Max sequence length             65535                                                                                                                      │
Shift hash                      67                                                                                                                         │
Split memory limit              0                                                                                                                          │
Include only extendable         false                                                                                                                      │
Skip repeating k-mers           false                                                                                                                      │
Threads                         64                                                                                                                         │
Compressed                      0                                                                                                                          │
Verbosity                       3

Your Environment

I installed mmseqs using mamba (conda).

durrantmm avatar Aug 03 '21 21:08 durrantmm

The maximal size for one clustering can not be more than (2^32 - 1), which is roughly 4 billion sequences. To cluster 16 billion you need some kind of step wise clustering by splitting them into batches.

martin-steinegger avatar Aug 04 '21 16:08 martin-steinegger

Any thoughts on how would you combine clusters across batches?

durrantmm avatar Aug 04 '21 18:08 durrantmm