foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

Clustering performance issue: Foldseek exceeds expected runtime

Open YFeriel opened this issue 1 year ago • 2 comments

Hello Foldseek team, @martin-steinegger @milot-mirdita

I am currently using Foldseek to perform a clusterization, and I am facing some issues with runtime duration. Here is the setup and process I followed:

  1. I downloaded the AlphaFold/UniProt database using the foldseek databases command.
  2. I concatenated this database with my own protein database, which contains approximately 700,000 structures.
  3. I ran the clusterization on a compute node with 64 CPUs using the following command: foldseek cluster /data/foldseek/concat_db /data/cluster_results /localscratch/yferiel.38388460.0/tmp_clusters -k 7 --threads 64

Despite running on a node with 64 CPUs, the clusterization has been taking over 7 days and is still not completed. According to the Foldseek article, it was mentioned that clustering on 64 CPUs typically takes about 5 days.

Additionally, on the compute cluster I am using, the maximum runtime per job is 7 days.

My questions are:

  1. Is there a way to accelerate the clusterization process given my setup?
  2. Does Foldseek support parallelism, or is there a specific configuration I could try to leverage more CPUs effectively?
  3. I attempted adding mpirun -np 64, but the command didn’t work. Does Foldseek support MPI-based parallelism, or is there an alternative method to achieve better performance?

Any advice or suggestions would be greatly appreciated!

Thank you in advance for your help.

Foldssek Output (for bugs)

Create directory /localscratch/yferiel.38388460.0/tmp_clusters cluster /data/foldseek/concat_db /data/cluster_results /localscratch/yferiel.38388460.0/tmp_clusters -k 7 --threads 64

MMseqs Version: 0dd4b7f27459d9e1d1bd8e01f97bcece8ce0dd39 Substitution matrix aa:3di.out,nucl:3di.out Seed substitution matrix aa:3di.out,nucl:3di.out Sensitivity 4 k-mer length 7 Target search mode 0 k-score seq:2147483647,prof:2147483647 Max sequence length 65535 Max results per query 1000 Split database 0 Split mode 2 Split memory limit 0 Coverage threshold 0.8 Coverage mode 0 Compositional bias 0 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 0 Mask residues probability 0.9 Mask lower case residues 1 Minimum diagonal score 30 Selected taxa Spaced k-mers 1 Preload mode 0 Spaced k-mer pattern Local temporary path Threads 64 Compressed 0 Verbosity 3 TMscore threshold 0 TMscore threshold mode 0 LDDT threshold 0 Sort by structure bit score 0 Alignment type 2 Exact TMscore 0 Add backtrace false Alignment mode 3 Alignment mode 0 E-value threshold 0.01 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Gap open cost aa:10,nucl:10 Gap extension cost aa:1,nucl:1 TMalign hit order 0 TMalign fast 1 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Weight file name Cluster Weight threshold 0.9 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files false Force restart with latest tmp false MPI runner k-mers per sequence 300 Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0

Set cluster sensitivity to -s 8.000000 Set cluster mode SET COVER Set cluster iterations to 3 kmermatcher /data/foldseek/concat_db_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 7 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

kmermatcher /data/foldseek/concat_db_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 7 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 215346985 type: Aminoacid

Not enough memory to process at once need to split [=================================================================] 215.35M 3m 3s 870ms Process file into 4 parts Generate k-mers list for 1 split [=================================================================] 215.35M 2m 23s 573ms Sort kmer 0h 5m 47s 225ms Sort by rep. sequence 0h 0m 7s 440ms Generate k-mers list for 2 split [=================================================================] 215.35M 2m 49s 529ms Sort kmer 0h 5m 37s 749ms Sort by rep. sequence 0h 0m 8s 618ms Generate k-mers list for 3 split [=================================================================] 215.35M 2m 51s 697ms Sort kmer 0h 5m 15s 714ms Sort by rep. sequence 0h 0m 13s 360ms Generate k-mers list for 4 split [=================================================================] 215.35M 2m 29s 413ms Sort kmer 0h 1m 42s 200ms Sort by rep. sequence 0h 0m 11s 510ms Merge splits ... Time for fill: 0h 10m 4s 90ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 0h 51m 0s 760ms structurerescorediagonal /data/foldseek/concat_db /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_rescore1 --exact-tmscore 0 --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 215.35M 31h 16m 46s 508ms Time for merging to pref_rescore1: 0h 2m 12s 27ms Time for processing: 31h 20m 26s 320ms clust /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_rescore1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 215.35M 3m 51s 449ms Sort entries Find missing connections Found 184261107 new connections. Reconstruct initial order [=================================================================] 215.35M 3m 48s 30ms Add missing connections [=================================================================] 215.35M 32s 481ms

Time for read in: 0h 8m 49s 690ms Total time: 0h 10m 14s 537ms

Size of the sequence database: 215346985 Size of the alignment database: 215346985 Number of clusters: 149526165

Writing results 0h 0m 16s 890ms Time for merging to pre_clust: 0h 0m 0s 1ms Time for processing: 0h 11m 6s 481ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/order_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms Time for processing: 0h 1m 14s 873ms filterdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter2 --filter-file /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/order_redundancy --threads 64 --compressed 0 -v 3

Filtering using file(s) [=================================================================] 149.53M 2m 3s 200ms Time for merging to pref_filter2: 0h 1m 4s 417ms Time for processing: 0h 3m 50s 865ms structurealign /data/foldseek/concat_db /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter2 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln.linclust --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 149.53M 4h 56m 21s 768ms Time for merging to aln.linclust: 0h 1m 26s 176ms Time for processing: 5h 38m 5s 930ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/order_redundancy /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clustered_seqs -v 3 --subdb-mode 1

Time for merging to pre_clustered_seqs: 0h 0m 0s 0ms Time for processing: 0h 1m 45s 100ms clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clustered_seqs /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln.linclust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clust.linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 149.53M 1m 55s 192ms Sort entries Find missing connections Found 173934104 new connections. Reconstruct initial order [=================================================================] 149.53M 1m 51s 311ms Add missing connections [=================================================================] 149.53M 32s 135ms

Time for read in: 0h 4m 47s 830ms Total time: 0h 5m 54s 232ms

Size of the sequence database: 149526165 Size of the alignment database: 149526165 Number of clusters: 111807234

Writing results 0h 0m 12s 545ms Time for merging to clust.linclust: 0h 0m 0s 0ms Time for processing: 0h 6m 30s 662ms mergeclusters /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clust.linclust --threads 64 --compressed 0 -v 3

Clustering step 1 [=================================================================] 149.53M 34s 732ms Clustering step 2 [=================================================================] 111.81M 1m 0s 239ms Write merged clustering [=================================================================] 215.35M 1m 15s 523ms Time for merging to clu_redundancy: 0h 0m 51s 558ms Time for processing: 0h 2m 55s 62ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /data/foldseek/concat_db_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ss: 0h 0m 0s 0ms Time for processing: 0h 1m 8s 838ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /data/foldseek/concat_db_ca /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ca -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ca: 0h 0m 0s 0ms Time for processing: 0h 1m 15s 672ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms Time for processing: 0h 1m 8s 961ms prefilter /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step0 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 1 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 100 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 0 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 111807234 type: Aminoacid Target split mode. Searching through 3 splits Estimated memory consumption: 182G Target database size: 111807234 type: Aminoacid Process prefiltering step 1 of 3

Index table k-mer threshold: 185 at k-mer size 7 Index table: counting k-mers [=================================================================] 37.38M 53s 721ms Index table: Masked residues: 0 Index table: fill [=================================================================] 37.38M 6s 375ms Index statistics Entries: 420189516 DB size: 12169 MB Avg k-mer size: 0.328273 Top 10 k-mers GQYYGNY 124540 AAEEEDP 93786 KIIIWDP 90430 LFEEAPS 68917 IWWDDKI 63080 WDDQKTK 60986 LFEEEPS 59563 FEEEAPV 52757 YEEQDSQ 51063 EYYAALV 48027 Time for index table init: 0h 1m 9s 963ms k-mer similarity threshold: 185 Starting prefiltering scores calculation (step 1 of 3) Query db start 1 to 111807234 Target db start 1 to 37383649 [=================================================================] 111.81M 2h 19m 3s 792ms

2.034511 k-mers per position 261116 DB matches per sequence 33622 overflows 24 sequences passed prefiltering per query sequence 1 median result list length 39281587 sequences with 0 size result lists Time for merging to pref_step0_tmp_0: 0h 0m 55s 144ms Time for merging to pref_step0_tmp_0_tmp: 0h 1m 39s 845ms Process prefiltering step 2 of 3

Index table k-mer threshold: 185 at k-mer size 7 Index table: counting k-mers [=================================================================] 37.22M 1m 3s 106ms Index table: Masked residues: 0 Index table: fill [=================================================================] 37.22M 5s 855ms Index statistics Entries: 420927068 DB size: 12174 MB Avg k-mer size: 0.328849 Top 10 k-mers GQYYGNY 124192 AAEEEDP 94425 KIIIWDP 91462 LFEEAPS 69744 IWWDDKI 63734 WDDQKTK 60864 LFEEEPS 59588 FEEEAPV 53786 YEEQDSQ 51530 EYYAALV 47833 Time for index table init: 0h 1m 19s 949ms k-mer similarity threshold: 185 Starting prefiltering scores calculation (step 2 of 3) Query db start 1 to 111807234 Target db start 37383650 to 74608124 [=================================================================] 111.81M 2h 13m 12s 977ms

2.034511 k-mers per position 264176 DB matches per sequence 35256 overflows 24 sequences passed prefiltering per query sequence 1 median result list length 39536708 sequences with 0 size result lists Time for merging to pref_step0_tmp_1: 0h 0m 54s 780ms Time for merging to pref_step0_tmp_1_tmp: 0h 2m 8s 350ms Process prefiltering step 3 of 3

Index table k-mer threshold: 185 at k-mer size 7 Index table: counting k-mers [=================================================================] 37.20M 31s 830ms Index table: Masked residues: 0 Index table: fill [=================================================================] 37.20M 7s 945ms Index statistics Entries: 422442059 DB size: 12182 MB Avg k-mer size: 0.330033 Top 10 k-mers GQYYGNY 125930 AAEEEDP 94585 KIIIWDP 91220 LFEEAPS 69973 IWWDDKI 63895 DDQIKIK 61549 WDDQKTK 61179 LFEEEPS 60583 FEEEAPV 53838 YEEQDSQ 51198 Time for index table init: 0h 0m 51s 326ms k-mer similarity threshold: 185 Starting prefiltering scores calculation (step 3 of 3) Query db start 1 to 111807234 Target db start 74608125 to 111807234 [=================================================================] 111.81M 2h 5m 42s 192ms

2.034511 k-mers per position 264625 DB matches per sequence 35306 overflows 24 sequences passed prefiltering per query sequence 1 median result list length 39606615 sequences with 0 size result lists Time for merging to pref_step0_tmp_2: 0h 0m 58s 263ms Time for merging to pref_step0_tmp_2_tmp: 0h 1m 35s 906ms Merging 3 target splits to pref_step0 Preparing offsets for merging: 0h 0m 45s 262ms [=================================================================] 111.81M 8m 42s 849ms Time for merging to pref_step0: 0h 1m 3s 947ms Time for merging target splits: 0h 10m 45s 55ms Time for merging to pref_step0_tmp: 0h 5m 27s 737ms Time for processing: 7h 22m 37s 479ms structurealign /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step0 --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 111.81M 22h 20m 10s 590ms Time for merging to aln_step0: 0h 1m 19s 609ms Time for processing: 22h 32m 41s 836ms clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 111.81M 28m 42s 739ms Sort entries Find missing connections Found 2135109749 new connections. Reconstruct initial order [=================================================================] 111.81M 56m 48s 981ms Add missing connections [=================================================================] 111.81M 50m 12s 798ms

Time for read in: 2h 44m 46s 139ms Total time: 2h 53m 15s 547ms

Size of the sequence database: 111807234 Size of the alignment database: 111807234 Number of clusters: 68722297

Writing results 0h 0m 7s 955ms Time for merging to clu_step0: 0h 0m 0s 3ms Time for processing: 2h 53m 43s 479ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss -v 3 --subdb-mode 1

Time for merging to input_step1_ss: 0h 0m 0s 0ms Time for processing: 0h 0m 28s 846ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ca /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ca -v 3 --subdb-mode 1

Time for merging to input_step1_ca: 0h 0m 0s 0ms Time for processing: 0h 0m 29s 516ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms Time for processing: 0h 0m 28s 718ms prefilter /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step1 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 4.5 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 68722297 type: Aminoacid Target split mode. Searching through 2 splits Estimated memory consumption: 162G Target database size: 68722297 type: Aminoacid Process prefiltering step 1 of 2

Index table k-mer threshold: 146 at k-mer size 7 Index table: counting k-mers [=================================================================] 34.53M 44s 682ms Index table: Masked residues: 0 Index table: fill [=================================================================] 34.53M 11s 255ms Index statistics Entries: 1194650611 DB size: 16601 MB Avg k-mer size: 0.933321 Top 10 k-mers VLLLLLL 2225745 VSLSLSL 1571417 VSSSSSS 1553687 SSSSSSS 1224579 NVSVSSS 1215989 LLLLLLV 971880 NVSSSSS 712330 SVNVSSS 674536 SSSLLLV 652511 SSVSNSV 633430 Time for index table init: 0h 1m 10s 884ms k-mer similarity threshold: 146 Starting prefiltering scores calculation (step 1 of 2) Query db start 1 to 68722297 Target db start 1 to 34532558 [=================================================================] 68.72M 13h 43m 12s 619ms

15.006364 k-mers per position 967821 DB matches per sequence 8753 overflows 82 sequences passed prefiltering per query sequence 140 median result list length 11830310 sequences with 0 size result lists Time for merging to pref_step1_tmp_0: 0h 0m 39s 754ms Time for merging to pref_step1_tmp_0_tmp: 0h 3m 44s 305ms Process prefiltering step 2 of 2

Index table k-mer threshold: 146 at k-mer size 7 Index table: counting k-mers [=================================================================] 34.19M 3m 10s 175ms Index table: Masked residues: 0 Index table: fill [=================================================================] 34.19M 11s 187ms Index statistics Entries: 1203088172 DB size: 16649 MB Avg k-mer size: 0.939913 Top 10 k-mers VLLLLLL 2225385 VSLSLSL 1577075 VSSSSSS 1550918 SSSSSSS 1221518 NVSVSSS 1216445 LLLLLLV 971123 NVSSSSS 712243 SVNVSSS 679810 SSSLLLV 658947 SSVSNSV 633800 Time for index table init: 0h 3m 36s 725ms k-mer similarity threshold: 146 Starting prefiltering scores calculation (step 2 of 2) Query db start 1 to 68722297 Target db start 34532559 to 68722297 [=================================================================] 68.72M 17h 48m 4s 387ms

15.006364 k-mers per position 972847 DB matches per sequence 9403 overflows 82 sequences passed prefiltering per query sequence 140 median result list length 12322936 sequences with 0 size result lists Time for merging to pref_step1_tmp_1: 0h 0m 35s 622ms Time for merging to pref_step1_tmp_1_tmp: 0h 3m 50s 81ms Merging 2 target splits to pref_step1 Preparing offsets for merging: 0h 0m 26s 484ms [================================================================] 68.72M =14m 6s 739ms Time for merging to pref_step1: 0h 0m 37s 358ms Time for merging target splits: 0h 15m 28s 932ms Time for merging to pref_step1_tmp: 0h 7m 8s 47ms Time for processing: 32h 31m 28s 338ms structurealign /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step1 --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 68.72M 21h 10m 0s 517ms Time for merging to aln_step1: 0h 0m 50s 864ms Time for processing: 25h 45m 11s 582ms clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover [=================================================================] 68.72M 43m 19s 831ms Sort entries Find missing connections Found 2071414253 new connections. Reconstruct initial order [=================================================================] 68.72M 49m 16s 343ms Add missing connections [=================================================================] 68.72M 44m 28s 611ms

Time for read in: 3h 2m 24s 369ms Total time: 3h 10m 50s 583ms

Size of the sequence database: 68722297 Size of the alignment database: 68722297 Number of clusters: 34529705

Writing results 0h 0m 4s 223ms Time for merging to clu_step1: 0h 0m 0s 24ms Time for processing: 3h 11m 8s 448ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ss -v 3 --subdb-mode 1

Time for merging to input_step2_ss: 0h 0m 0s 0ms Time for processing: 0h 0m 16s 694ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ca /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ca -v 3 --subdb-mode 1

Time for merging to input_step2_ca: 0h 0m 0s 0ms Time for processing: 0h 0m 17s 274ms createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2 -v 3 --subdb-mode 1

Time for merging to input_step2: 0h 0m 0s 0ms Time for processing: 0h 0m 16s 632ms prefilter /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step2 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 8 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 34529705 type: Aminoacid Estimated memory consumption: 150G Target database size: 34529705 type: Aminoacid Index table k-mer threshold: 107 at k-mer size 7 Index table: counting k-mers [=================================================================] 34.53M 10m 17s 308ms Index table: Masked residues: 0 Index table: fill [=================================================================] 34.53M 19s 801ms Index statistics Entries: 2473273389 DB size: 23917 MB Avg k-mer size: 1.932245 Top 10 k-mers DDDDDDD 14776321 DDDDDDP 13093909 DDDDDPP 11500036 DDDDPDD 9107859 DDDPDDD 8270776 DDDDPPP 7786765 DDDPPPP 6854484 DDPPPPP 5727000 VLVLVVV 5555350 SVSVVVV 5227077 Time for index table init: 0h 11m 18s 790ms Hard disk might not have enough free space (343G left).The prefilter result might need up to 1T. Process prefiltering step 1 of 1

k-mer similarity threshold: 107 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 34529705 Target db start 1 to 34529705 [=============================

YFeriel avatar Dec 19 '24 15:12 YFeriel

Yes this kind of clustering takes time. It seems the prefilter did process quite a bit already.

[=============================

If you want to speed up the process you could pre-cluster it first with MMseqs2 and then cluster the representatives using Foldseek.

martin-steinegger avatar Dec 29 '24 10:12 martin-steinegger

Dear @martin-steinegger

Thank you for your suggestion. I understand the rationale behind using MMseqs2 for pre-clustering, but I am facing a particular challenge with my dataset. The proteins I am studying have very low sequence homology, which is precisely why I opted to use Foldseek—to explore structural homology instead of sequence similarity.

Given this, I am wondering: would it still be meaningful to use MMseqs2 for pre-clustering in this context, knowing that the sequence homology is negligible? Would the pre-clustering step provide any advantage when applied to such a dataset. Your insight on this matter would be greatly appreciated.

Best regards

YFeriel avatar Dec 30 '24 03:12 YFeriel