ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

setup_databases fails with "Error: indexdb died" when running mmseqs createindex step

Open ulupo opened this issue 3 years ago • 4 comments

Thanks for the awesome work on this repository! I have been trying to set up all databases locally as explained in the README, on an empty 2TB drive I have in my Ubuntu machine. I compiled MMseqs2 from sources just a couple of days ago, with no errors or hiccups. Everything seems to run smoothly until line https://github.com/sokrypton/ColabFold/blob/15bf1c06802432296c7fab1559692a6a16e24bd7/setup_databases.sh#L27 is processed. This step fails somewhere along the way as shown in this traceback:

+ mmseqs createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1
createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 

MMseqs Version:          	edb8223d1ea07385ffe63d4f103af0eb12b2058e
Seed substitution matrix 	aa:VTML80.out,nucl:nucleotide.out
k-mer length             	0
Alphabet size            	aa:21,nucl:5
Compositional bias       	1
Max sequence length      	65535
Max results per query    	300
Mask residues            	1
Mask lower case residues 	0
Spaced k-mers            	1
Spaced k-mer pattern     	
Sensitivity              	7.5
k-score                  	seq:0,prof:0
Check compatible         	0
Search type              	0
Split database           	0
Split memory limit       	0
Verbosity                	3
Threads                  	16
Min codons in orf        	30
Max codons in length     	32734
Max orf gaps             	2147483647
Contig start mode        	2
Contig end mode          	2
Orf start mode           	1
Forward frames           	1,2,3
Reverse frames           	1,2,3
Translation table        	1
Translate orf            	0
Use all table starts     	false
Offset of numeric ids    	0
Create lookup            	0
Compressed               	0
Add orf stop             	false
Overlap between sequences	0
Sequence split mode      	1
Header split mode        	0
Strand selection         	1
Remove temporary files   	true

createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 

MMseqs Version:          	edb8223d1ea07385ffe63d4f103af0eb12b2058e
Seed substitution matrix 	aa:VTML80.out,nucl:nucleotide.out
k-mer length             	0
Alphabet size            	aa:21,nucl:5
Compositional bias       	1
Max sequence length      	65535
Max results per query    	300
Mask residues            	1
Mask lower case residues 	0
Spaced k-mers            	1
Spaced k-mer pattern     	
Sensitivity              	7.5
k-score                  	seq:0,prof:0
Check compatible         	0
Search type              	0
Split database           	0
Split memory limit       	0
Verbosity                	3
Threads                  	16
Min codons in orf        	30
Max codons in length     	32734
Max orf gaps             	2147483647
Contig start mode        	2
Contig end mode          	2
Orf start mode           	1
Forward frames           	1,2,3
Reverse frames           	1,2,3
Translation table        	1
Translate orf            	0
Use all table starts     	false
Offset of numeric ids    	0
Create lookup            	0
Compressed               	0
Add orf stop             	false
Overlap between sequences	0
Sequence split mode      	1
Header split mode        	0
Strand selection         	1
Remove temporary files   	true

indexdb colabfold_envdb_202108_db colabfold_envdb_202108_db --seed-sub-mat aa:VTML80.out,nucl:nucleotide.out -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 0 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 16 

Target split mode. Searching through 34 splits
Estimated memory consumption: 31G
Write VERSION (0)
Write META (1)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
Write SCOREMATRIXNAME (2)
Write SPACEDPATTERN (23)
Write GENERATOR (22)
Write DBR1INDEX (5)
Write DBR1DATA (6)
Write DBR2INDEX (7)
Killed
Error: indexdb died

Perhaps this is an issue better raised in the MMseqs2 repo, I'm not sure.

ulupo avatar Dec 05 '21 10:12 ulupo

It occurs because of a memory overflow. It worked for me when I ran the process on a machine with larger memory (>160Gb).

LiorZ avatar Dec 07 '21 11:12 LiorZ

@LiorZ thanks for your reply! Wow, ok! Here was me thinking I would make it happen with 64GB. Is there any plan to mitigate this?

If not, I would suggest making a short note in the README. I'm happy to help if needed.

ulupo avatar Dec 07 '21 11:12 ulupo

Online searches: Our Colabfold server has ~760GB RAM and keeps full database and index in memory. Batch searches: To perform a batch search you require less memory. But its still approx 1 byte per residue. So I would assume you would probably require at least 90GB. We still need to figure out whats the lower bound for this database.

martin-steinegger avatar Dec 08 '21 07:12 martin-steinegger

Thanks @martin-steinegger for your clarification! I don't know much about these things but I lazily wonder if memory mapping could help limit RAM usage. For now, is there any ColabFold functionality I can use locally? Given that things worked up to and including https://github.com/sokrypton/ColabFold/blob/b93680f62a305951cb7aa402903b34f595a6156a/setup_databases.sh#L54

ulupo avatar Dec 08 '21 08:12 ulupo