openfold
openfold copied to clipboard
Error in precompute_alignments_mmseqs.py
Hi there, I installed the openfold and followed the commands below trying to genereate MSA for my own data. But I ran into this error. On my Linux VM, I have ran: (I have aws, aria2, and mmseqs installed before hand.
1. installation
git clone https://github.com/aqlaboratory/openfold.git cd openfold scripts/install_third_party_dependencies.sh source scripts/activate_conda_env.sh python3 setup.py install
it worked fine here without any errors.
2. data downloading and preprocessing
bash scripts/download_pdb70.sh data bash scripts/download_mmseqs_dbs.sh data # downloads .tar files bash scripts/prep_mmseqs_dbs.sh data # unpacks and preps the databases
Here, everything seemed work out fine except for the last command, it shows: Cannot create temporary directory data/tmp/ However, it seems like all the files are in the right place, at least I thought.
3. MSA generation: I have two sequences in the input.fasta file, and mmseqs path provided.
python3 scripts/precompute_alignments_mmseqs.py input.fasta data/mmseqs_dbs/ uniref30_2103_db alignment_dir /datadrive2/openfold/mmseqs/bin/mmseqs --hhsearch_binary_path /usr/bin/hhsearch --env_db colabfold_envdb_202108_db --pdb70 data/pdb70/pdb70
This command runs for a while until I get this error:
**Traceback (most recent call last):
File "scripts/precompute_alignments_mmseqs.py", line 175, in
Converting sequences [ Time for merging to qdb_h: 0h 0m 0s 0ms Time for merging to qdb: 0h 0m 0s 0ms Database type: Aminoacid Time for processing: 0h 0m 0s 547ms Create directory alignment_dir/tmp search alignment_dir/qdb data/mmseqs_dbs//uniref30_2103_db alignment_dir/res alignment_dir/tmp --num-iterations 3 --db-load-mode 0 -a -s 8 -e 0.1 --max-seqs 10000
prefilter alignment_dir/qdb data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3
Query database size: 2 type: Aminoacid Estimated memory consumption: 188G Target database size: 29291635 type: Aminoacid Index table k-mer threshold: 96 at k-mer size 7 Index table: counting k-mers [=================================================================] 29.29M 16s 290ms Index table: Masked residues: 253951524 Index table: fill [=================================================================] 29.29M 25s 254ms Index statistics Entries: 6241392740 DB size: 45479 MB Avg k-mer size: 4.876088 Top 10 k-mers LAMHETP 13262 FLNSHRT 11141 KSFANHE 8776 AYITSTG 8484 LLGPGKT 8019 LAGAHNN 6286 FGGSSYL 5892 RGRELIE 5346 LNAEAAG 5328 LYLQAAW 5269 Time for index table init: 0h 0m 50s 0ms Process prefiltering step 1 of 1
k-mer similarity threshold: 96 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 2 Target db start 1 to 29291635 [=================================================================] 2 0s 9ms
34119.765188 k-mers per position 48674383 DB matches per sequence 0 overflows 10000 sequences passed prefiltering per query sequence 10000 median result list length 0 sequences with 0 size result lists Time for merging to pref_0: 0h 0m 0s 0ms Time for processing: 0h 0m 54s 622ms align alignment_dir/qdb data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/pref_0 alignment_dir/tmp/10413594507028593022/aln_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3
Compute score only Query database size: 2 type: Aminoacid Target database size: 29291635 type: Aminoacid Calculation of alignments [=================================================================] 2 0s 305ms Time for merging to aln_0: 0h 0m 0s 0ms 20000 alignments calculated 9025 sequence pairs passed the thresholds (0.451250 of overall calculated) 4512.500000 hits per query sequence Time for processing: 0h 0m 1s 909ms result2profile alignment_dir/qdb data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/aln_0 alignment_dir/tmp/10413594507028593022/profile_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3
Query database size: 2 type: Aminoacid Target database size: 29291635 type: Aminoacid [=================================================================] 2 0s 19ms Time for merging to profile_0: 0h 0m 0s 0ms Time for processing: 0h 0m 1s 773ms prefilter alignment_dir/tmp/10413594507028593022/profile_0 data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/pref_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3
Query database size: 2 type: Profile Estimated memory consumption: 188G Target database size: 29291635 type: Aminoacid Index table k-mer threshold: 0 at k-mer size 7 Index table: counting k-mers [=================================================================] 29.29M 15s 204ms Index table: Masked residues: 321079962 Index table: fill [=================================================================] 29.29M 23s 445ms Index statistics Entries: 6172485125 DB size: 45084 MB Avg k-mer size: 4.822254 Top 10 k-mers LAMHETP 13247 FLNSHRT 11133 KSFANHE 8765 AYITSTG 8477 LLGPGKT 8019 LAGAHNN 6284 FGGSSYL 5890 LYLQAAW 5272 SSSSSSS 4521 GRFVVEV 4201 Time for index table init: 0h 0m 45s 581ms Process prefiltering step 1 of 1
k-mer similarity threshold: 94 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 2 Target db start 1 to 29291635 [=================================================================] 2 0s 3ms
10481.630242 k-mers per position 13094781 DB matches per sequence 0 overflows 10000 sequences passed prefiltering per query sequence 10000 median result list length 0 sequences with 0 size result lists Time for merging to pref_tmp_1: 0h 0m 0s 0ms Time for processing: 0h 0m 47s 236ms subtractdbs alignment_dir/tmp/10413594507028593022/pref_tmp_1 alignment_dir/tmp/10413594507028593022/aln_0 alignment_dir/tmp/10413594507028593022/pref_1 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3
subtractdbs alignment_dir/tmp/10413594507028593022/pref_tmp_1 alignment_dir/tmp/10413594507028593022/aln_0 alignment_dir/tmp/10413594507028593022/pref_1 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3
Remove alignment_dir/tmp/10413594507028593022/aln_0 ids from alignment_dir/tmp/10413594507028593022/pref_tmp_1 [=================================================================] 2 0s 15ms Time for merging to pref_1: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 90ms rmdb alignment_dir/tmp/10413594507028593022/pref_tmp_1
Time for processing: 0h 0m 0s 0ms align alignment_dir/tmp/10413594507028593022/profile_0 data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/pref_1 alignment_dir/tmp/10413594507028593022/aln_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3
Compute score, coverage and sequence identity Query database size: 2 type: Profile Target database size: 29291635 type: Aminoacid Calculation of alignments [=================================================================] 2 0s 378ms Time for merging to aln_tmp_1: 0h 0m 0s 0ms 11762 alignments calculated 2505 sequence pairs passed the thresholds (0.212974 of overall calculated) 1252.500000 hits per query sequence Time for processing: 0h 0m 1s 392ms mergedbs alignment_dir/tmp/10413594507028593022/profile_0 alignment_dir/tmp/10413594507028593022/aln_1 alignment_dir/tmp/10413594507028593022/aln_0 alignment_dir/tmp/10413594507028593022/aln_tmp_1
Merging the results to alignment_dir/tmp/10413594507028593022/aln_1 [=================================================================] 2 0s 0ms Time for merging to aln_1: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 456ms rmdb alignment_dir/tmp/10413594507028593022/aln_0
Time for processing: 0h 0m 0s 0ms rmdb alignment_dir/tmp/10413594507028593022/aln_tmp_1
Time for processing: 0h 0m 0s 0ms result2profile alignment_dir/tmp/10413594507028593022/profile_0 data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/aln_1 alignment_dir/tmp/10413594507028593022/profile_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3
Query database size: 2 type: Profile Target database size: 29291635 type: Aminoacid [=================================================================] 2 0s 31ms Time for merging to profile_1: 0h 0m 0s 0ms Time for processing: 0h 0m 1s 19ms prefilter alignment_dir/tmp/10413594507028593022/profile_1 data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/pref_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3
Query database size: 2 type: Profile Estimated memory consumption: 188G Target database size: 29291635 type: Aminoacid Index table k-mer threshold: 0 at k-mer size 7 Index table: counting k-mers [=================================================================] 29.29M 14s 958ms Index table: Masked residues: 321079962 Index table: fill [=================================================================] 29.29M 22s 623ms Index statistics Entries: 6172485125 DB size: 45084 MB Avg k-mer size: 4.822254 Top 10 k-mers LAMHETP 13247 FLNSHRT 11133 KSFANHE 8765 AYITSTG 8477 LLGPGKT 8019 LAGAHNN 6284 FGGSSYL 5890 LYLQAAW 5272 SSSSSSS 4521 GRFVVEV 4201 Time for index table init: 0h 0m 44s 593ms Process prefiltering step 1 of 1
k-mer similarity threshold: 94 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 2 Target db start 1 to 29291635 [=================================================================] 2 0s 5ms
9149.571371 k-mers per position 10949830 DB matches per sequence 0 overflows 10000 sequences passed prefiltering per query sequence 10000 median result list length 0 sequences with 0 size result lists Time for merging to pref_tmp_2: 0h 0m 0s 0ms Time for processing: 0h 0m 45s 932ms subtractdbs alignment_dir/tmp/10413594507028593022/pref_tmp_2 alignment_dir/tmp/10413594507028593022/aln_1 alignment_dir/tmp/10413594507028593022/pref_2 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3
subtractdbs alignment_dir/tmp/10413594507028593022/pref_tmp_2 alignment_dir/tmp/10413594507028593022/aln_1 alignment_dir/tmp/10413594507028593022/pref_2 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3
Remove alignment_dir/tmp/10413594507028593022/aln_1 ids from alignment_dir/tmp/10413594507028593022/pref_tmp_2 [=================================================================] 2 0s 16ms Time for merging to pref_2: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 73ms rmdb alignment_dir/tmp/10413594507028593022/pref_tmp_2
Time for processing: 0h 0m 0s 0ms align alignment_dir/tmp/10413594507028593022/profile_1 data/mmseqs_dbs//uniref30_2103_db alignment_dir/tmp/10413594507028593022/pref_2 alignment_dir/tmp/10413594507028593022/aln_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3
Compute score, coverage and sequence identity Query database size: 2 type: Profile Target database size: 29291635 type: Aminoacid Calculation of alignments [=================================================================] 2 0s 517ms Time for merging to aln_tmp_2: 0h 0m 0s 0ms 9712 alignments calculated 1671 sequence pairs passed the thresholds (0.172055 of overall calculated) 835.500000 hits per query sequence Time for processing: 0h 0m 1s 372ms mergedbs alignment_dir/tmp/10413594507028593022/profile_1 alignment_dir/res alignment_dir/tmp/10413594507028593022/aln_1 alignment_dir/tmp/10413594507028593022/aln_tmp_2
Merging the results to alignment_dir/res [=================================================================] 2 0s 0ms Time for merging to res: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 1ms rmdb alignment_dir/tmp/10413594507028593022/aln_1
Time for processing: 0h 0m 0s 0ms rmdb alignment_dir/tmp/10413594507028593022/aln_tmp_2
Time for processing: 0h 0m 0s 0ms expandaln alignment_dir/qdb data/mmseqs_dbs//uniref30_2103_db.idx alignment_dir/res data/mmseqs_dbs//uniref30_2103_db.idx alignment_dir/res_exp --db-load-mode 0 --expansion-mode 0 -e inf --expand-filter-clusters 1 --max-seq-id 0.95
stderr: Input data/mmseqs_dbs//uniref30_2103_db.idx does not exist**
I looked into the mmseqs_dbs folder there is no such file but a file named uniref30_2103_db.index, I tried to copy this file and rename one, it still didn't work.
4. I have cuda installed but I am not sure this step is using GPU though.
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0
and here is my linux system info: Linux pretain2 5.15.0-1034-azure #41~20.04.1-Ubuntu SMP Sat Feb 11 17:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Any help would be appreciated. I ran openfold before and it worked. now I just tried to rerun using the latest version. Not sure why getting this error.