MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Creating index for ColabDB failed on cluster.

Open IlyesAbdelhamid opened this issue 2 years ago • 4 comments

Hello,

I've been encountering an issue for creating index of ColabDB. It looks like it is a memory consumption issue. Could you help me with this matter please? Thank you in advance for your help.

Sincerely, Ilyes

Expected Behavior

An index file of the colabfold_envdb_202108_db is computed for a fast read-in.

Current Behavior

Error: indexdb died slurmstepd: error: Detected 1 oom-kill event(s) in StepId=27501792.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Steps to Reproduce (for bugs)

I am using the following commands to build the database as indicated here: https://colabfold.mmseqs.com/ Uniref30 was successful but not ColabDB.

wget https://raw.githubusercontent.com/sokrypton/ColabFold/main/setup_databases.sh chmod +x setup_databases.sh ./setup_databases.sh database/

MMseqs Output (for bugs)

  • ARIA_NUM_CONN=8
  • WORKDIR=database/
  • cd database/ ++ pwd
  • export PATH=/lustre/ssd/ws/iabdelha-IA-AF-SSD-workspace/alphafold/alphafold_output/Output_test_running_time/database/mmseqs/bin/:/lustre/ssd/ws/iabdelha-IA-AF-SSD-workspace/alphafold/data/colabfold_batch/bin:/usr/lib64/qt-3.3/bin:/sw/taurus/tools/slurmtools/default/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
  • PATH=/lustre/ssd/ws/iabdelha-IA-AF-SSD-workspace/alphafold/alphafold_output/Output_test_running_time/database/mmseqs/bin/:/lustre/ssd/ws/iabdelha-IA-AF-SSD-workspace/alphafold/data/colabfold_batch/bin:/usr/lib64/qt-3.3/bin:/sw/taurus/tools/slurmtools/default/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
  • STRATEGY=
  • hasCommand aria2c
  • command -v aria2c
  • hasCommand curl
  • command -v curl
  • STRATEGY=' CURL'
  • hasCommand wget
  • command -v wget
  • STRATEGY=' CURL WGET'
  • '[' ' CURL WGET' = '' ']'
  • '[' '!' -f COLABDB_READY ']'
  • mmseqs createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 --split 1 createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 --split 1

MMseqs Version: 3b9cf88179737563acfdb83b516c0b5219cc531e Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out k-mer length 0 Alphabet size aa:21,nucl:5 Compositional bias 1 Compositional bias 1 Max sequence length 65535 Max results per query 300 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Spaced k-mers 1 Spaced k-mer pattern Sensitivity 7.5 k-score seq:0,prof:0 Check compatible 0 Search type 0 Split database 1 Split memory limit 0 Verbosity 3 Threads 56 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Compressed 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Strand selection 1 Remove temporary files true

createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 --split 1

MMseqs Version: 3b9cf88179737563acfdb83b516c0b5219cc531e Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out k-mer length 0 Alphabet size aa:21,nucl:5 Compositional bias 1 Compositional bias 1 Max sequence length 65535 Max results per query 300 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Spaced k-mers 1 Spaced k-mer pattern Sensitivity 7.5 k-score seq:0,prof:0 Check compatible 0 Search type 0 Split database 1 Split memory limit 0 Verbosity 3 Threads 56 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Compressed 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Strand selection 1 Remove temporary files true

indexdb colabfold_envdb_202108_db colabfold_envdb_202108_db --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 0 --search-type 0 --split 1 --split-memory-limit 0 -v 3 --threads 56

Estimated memory consumption: 780G Write VERSION (0) Write META (1) Write SCOREMATRIX3MER (4) Write SCOREMATRIX2MER (3) Write SCOREMATRIXNAME (2) Write SPACEDPATTERN (23) Write GENERATOR (22) Write DBR1INDEX (5) Write DBR1DATA (6) Write DBR2INDEX (7) Write DBR2DATA (8) Write HDR1INDEX (18) Write HDR1DATA (19) Write ALNINDEX (24) Write ALNDATA (25) Index table: counting k-mers [=================================================================tmp2/7152678087979496025/createindex.sh: line 56: 37309 Killed "$MMSEQS" $INDEXER "$INPUT" "$INPUT" ${INDEX_PAR} Error: indexdb died slurmstepd: error: Detected 1 oom-kill event(s) in StepId=27501792.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Your Environment

I am running the script on a cluster. You will find below the batch script parameters: #!/bin/bash #SBATCH --job-name Install_ColabFold_DB ##SBATCH --account=def-someuser #SBATCH --time 24:00:00 ### (HH:MM:SS) the job will expire after this time, the maximum is 168:00:00 #SBATCH -N 1 ### number of nodes (1 node -> several CPUs) #SBATCH --ntasks 1 #SBATCH --cpus-per-task 24 #SBATCH --mem-per-cpu 10000 #SBATCH -A p_linkpredic ##SBATCH -e %j.err ### redirects stderr to this file ##SBATCH -o %j.out ### redirects standard output stdout to this file #SBATCH -p haswell ### types of nodes on taurus: west, dandy, smp, gpu

IlyesAbdelhamid avatar Aug 01 '22 09:08 IlyesAbdelhamid

You need a machine with 1TB of ram to create a pre computed index for the ColabFoldDB.

Are you actually planning to run a lot of small queries (like the ColabFold server)? Or are you just planning to run colabfold_search/colabfold_batch with a bunch or proteins at the same time?

if it’s the second I recommend to not create an index at all. A search without a pre computed index creates the index on the fly and has a lot lower resource requirements.

Precomputing the index only makes sense for something like our API server, where we have repeatedly many small queries and want to pay the cost only once for the index.

milot-mirdita avatar Aug 01 '22 09:08 milot-mirdita

Thank you for the prompt reply! Ok I see. The idea is to run colabfold_search/colabfold_batch with a bunch or proteins at the same time. I've been using the API server but some of my jobs encountered ratelimits. To avoid this issue, I decided to build the databases and search against them locally.

Sincerely, Ilyes

IlyesAbdelhamid avatar Aug 01 '22 09:08 IlyesAbdelhamid

Then I would recommend to delete the already created precomputed index (rm *.idx*) and just use colabfold_search without the precomputed index.

milot-mirdita avatar Aug 01 '22 09:08 milot-mirdita

I wanted to compare the running time of the MSA search against the databases locally and by means of the API server Thus, I provided to colabfold_search a FASTA file containing two protein sequences. It has been running for over two hours now with the option --db-load-mode 3, while the Colab server managed a time of 45 min. Is there any way to process the MSA search as fast as the remote server?

Sincerely, Ilyes

IlyesAbdelhamid avatar Aug 01 '22 14:08 IlyesAbdelhamid