ColabFold
ColabFold copied to clipboard
setup_databases fails with "Error: indexdb died" when running mmseqs createindex step
Thanks for the awesome work on this repository! I have been trying to set up all databases locally as explained in the README, on an empty 2TB drive I have in my Ubuntu machine. I compiled MMseqs2
from sources just a couple of days ago, with no errors or hiccups. Everything seems to run smoothly until line
https://github.com/sokrypton/ColabFold/blob/15bf1c06802432296c7fab1559692a6a16e24bd7/setup_databases.sh#L27
is processed. This step fails somewhere along the way as shown in this traceback:
+ mmseqs createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 MMseqs Version: edb8223d1ea07385ffe63d4f103af0eb12b2058e Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out k-mer length 0 Alphabet size aa:21,nucl:5 Compositional bias 1 Max sequence length 65535 Max results per query 300 Mask residues 1 Mask lower case residues 0 Spaced k-mers 1 Spaced k-mer pattern Sensitivity 7.5 k-score seq:0,prof:0 Check compatible 0 Search type 0 Split database 0 Split memory limit 0 Verbosity 3 Threads 16 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Compressed 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Strand selection 1 Remove temporary files true createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 MMseqs Version: edb8223d1ea07385ffe63d4f103af0eb12b2058e Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out k-mer length 0 Alphabet size aa:21,nucl:5 Compositional bias 1 Max sequence length 65535 Max results per query 300 Mask residues 1 Mask lower case residues 0 Spaced k-mers 1 Spaced k-mer pattern Sensitivity 7.5 k-score seq:0,prof:0 Check compatible 0 Search type 0 Split database 0 Split memory limit 0 Verbosity 3 Threads 16 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Compressed 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Strand selection 1 Remove temporary files true indexdb colabfold_envdb_202108_db colabfold_envdb_202108_db --seed-sub-mat aa:VTML80.out,nucl:nucleotide.out -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 0 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 16 Target split mode. Searching through 34 splits Estimated memory consumption: 31G Write VERSION (0) Write META (1) Write SCOREMATRIX3MER (4) Write SCOREMATRIX2MER (3) Write SCOREMATRIXNAME (2) Write SPACEDPATTERN (23) Write GENERATOR (22) Write DBR1INDEX (5) Write DBR1DATA (6) Write DBR2INDEX (7) Killed Error: indexdb died
Perhaps this is an issue better raised in the MMseqs2 repo, I'm not sure.
It occurs because of a memory overflow. It worked for me when I ran the process on a machine with larger memory (>160Gb).
@LiorZ thanks for your reply! Wow, ok! Here was me thinking I would make it happen with 64GB. Is there any plan to mitigate this?
If not, I would suggest making a short note in the README. I'm happy to help if needed.
Online searches: Our Colabfold server has ~760GB RAM and keeps full database and index in memory. Batch searches: To perform a batch search you require less memory. But its still approx 1 byte per residue. So I would assume you would probably require at least 90GB. We still need to figure out whats the lower bound for this database.
Thanks @martin-steinegger for your clarification! I don't know much about these things but I lazily wonder if memory mapping could help limit RAM usage. For now, is there any ColabFold
functionality I can use locally? Given that things worked up to and including
https://github.com/sokrypton/ColabFold/blob/b93680f62a305951cb7aa402903b34f595a6156a/setup_databases.sh#L54