MMseqs2
MMseqs2 copied to clipboard
mmseqs much slower than the MMseqs2 MSA server
Expected Behavior
The analysis finished in minutes on MMSeq2 MSA server using colabfold
Current Behavior
Local mmseqs always paused for hours without generating outputs
Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
I am using colab_search
which calls mmseqs
like search search_results/qdb db/uniref30_2103_db search_results/res search_results/tmp --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000 --split 8
. The query contains 4 amino acid sequences, and each has the length of 493 amino acid.
NOTE, when I took off --split 8
, I also observed that mmseqs halts at certain stage.
MMseqs Output (for bugs)
search search_results/qdb db/uniref30_2103_db search_results/res search_results/tmp --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000 --split 8 [93/1999] MMseqs Version: b768f48f0bd73688b6a68132159a97f7b6310f03 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace true Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.1 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 2 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Threads 72 Compressed 0 Verbosity 3 Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 8 k-mer length 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max results per query 10000 Split database 8 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Exact k-mer matching 0 [49/1999] Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Spaced k-mers 1 Spaced k-mer pattern Local temporary path Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.1 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Gap pseudo count 10 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 3 Start sensitivity 4 Search iterations 3 [5/1999] Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner Force restart with latest tmp false Remove temporary files false prefilter search_results/qdb db/uniref30_2103_db.idx search_results/tmp/12005814431969335264/pref_0 --sub-mat aa:blosum62.out,nucl:nucleotide.out --seed-sub-mat aa:VTML80.out,nuc l:nucleotide.out -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 8 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kme r-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 72 --compressed 0 -v 3 Index version: 16 Generated by: b768f48f0bd73688b6a68132159a97f7b6310f03 ScoreMatrix: VTML80.out Query database size: 190 type: Aminoacid Estimated memory consumption: 148G Target database size: 29291635 type: Aminoacid Process prefiltering step 1 of 1 k-mer similarity threshold: 96 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 190 Target db start 1 to 29291635 ^CTraceback (most recent call last): ] 37.57% 72 eta 0s
I had to stop it as mmseqs took hours without progress.
Context
I am quite puzzled what I should do to figure this out.
The machine is located on our cluster, so there is enough disk space and memory.
I tried to check the process status, and it is always in the D
status with 100-200% CPU usage ( based on htop
outputs).
Not sure how I can speed things up at this stage.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): b768f48f0bd73688b6a68132159a97f7b6310f03
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-complied
- For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: gcc 6.1
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, support AVX2/SSE, total 503 G memory (
free -g
) - Operating system and version: Red Hat Enterprise Linux Server release 7.6 (Maipo)
The issue is probably related to file system. I will close for now.
I changed the --db-load-mode
from 2
to 3
, and the performance improves a lot.
Where can I find the documentation on the option ``--db-load-mode`? Just want to understand this better.
Here you can read more about MMseqs2: https://github.com/soedinglab/MMseqs2/wiki
I read the wiki and User Guide.
Although there are examples about --db-load-mode 2
, none mentions or explains --db-load-mode 3
.
I think I encountered same question like you, and my HPC node similar with yours, it kept running almost 17h and no progress, I'm wondering that when you set the param --db-load-mode 3
then rerun it, how long could you detect the output?
Any anwser would be helpful! Thanks!