MMseqs2
MMseqs2 copied to clipboard
High inter-cluster identity
Expected Behavior
When min-seq-id is eg 50%, I would expect identity across clusters to be upper bounded by 50%
Current Behavior
Running blastp of sequences from a cluster against sequences from all other cluster yields high percent id and low e value hits.
Here is a table of some results I have tried (tsv). No matter what i do, inter cluster identity remains high:
--cluster-mode --cluster-steps --cluster-reassign --cov-mode -c -e -s --min-seq-id --threads --max-seqs --max-iterations --alignment-mode --similarity-type num_aligned mean_percid mean_e max_percid min_e
0.0 3.0 0.0 0.0 0.8 0.001 4.0 0.9 40.0 20.0 1000.0 3.0 2.0 20.0 0.2946001520957981 7.271040887500001e-25 0.4017857142857143 4.56243e-152
0.0 3.0 0.0 0.0 0.8 0.001 4.0 0.4 40.0 20.0 1000.0 3.0 2.0 19.0 0.2699960748418551 6.464939793952533e-09 0.3841336116910229 6.08884e-63
0.0 3.0 1.0 0.0 0.8 0.001 4.0 0.4 40.0 20.0 1000.0 3.0 2.0 19.0 0.2805825389935189 6.463308203666264e-09 0.379746835443038 4.30425e-156
0.0 3.0 1.0 0.0 0.8 0.001 4.0 0.4 40.0 200.0 1000.0 3.0 2.0 19.0 0.2684551720607656 6.4674134954082795e-09 0.365038560411311 1.55836e-137
1.0 3.0 1.0 0.0 0.8 0.001 4.0 0.4 40.0 200.0 1000.0 3.0 2.0 18.0 0.2448913043192062 6.821603006346129e-09 0.365038560411311 5.02351e-59
3.0 3.0 1.0 0.0 0.8 0.001 4.0 0.4 40.0 200.0 1000.0 3.0 2.0 20.0 0.275898345457947 2.438759428532657e-07 0.4268774703557312 4.97171e-133
2.0 3.0 1.0 0.0 0.8 0.001 4.0 0.4 40.0 200.0 1000.0 3.0 2.0 20.0 0.275898345457947 2.438759428532657e-07 0.4268774703557312 4.97171e-133
1.0 3.0 1.0 0.0 0.8 0.001 7.0 0.4 40.0 200.0 1000.0 3.0 2.0 18.0 0.2541259924867629 6.820936331055556e-09 0.3755458515283842 7.644030000000001e-61
1.0 3.0 1.0 0.0 0.8 0.001 7.0 0.4 40.0 200.0 1000.0 3.0 2.0 18.0 0.2541259924867629 6.820936331055556e-09 0.3755458515283842 7.644030000000001e-61
1.0 3.0 1.0 0.0 0.8 0.001 7.0 0.4 40.0 200.0 1000.0 3.0 2.0 18.0 0.25412599248676293 6.820936331055556e-09 0.37554585152838427 7.64403e-61
Steps to Reproduce (for bugs)
My scripts
import os
import shutil
import subprocess
import pandas as pd
from Bio.Blast.Applications import NcbiblastpCommandline
from Bio import SeqIO
from Bio.Blast.NCBIXML import parse
# MMseqs clustering function
def run_mmseqs_clustering(db_path, output_dir, params):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
else:
shutil.rmtree(output_dir)
os.makedirs(output_dir)
cluster_out = os.path.join(output_dir, "mmseq_clu")
tmp_dir = os.path.join(output_dir, "tmp")
os.makedirs(tmp_dir, exist_ok=True)
cmd = ["mmseqs", "cluster", db_path, cluster_out, tmp_dir] + params
subprocess.run(cmd, check=True)
# Convert to TSV
tsv_out = os.path.join(output_dir, "mmseq_clu.tsv")
subprocess.run(["mmseqs", "createtsv", db_path, db_path, cluster_out, tsv_out], check=True)
return cluster_out, tsv_out
def params_to_cmd_args(params):
args = []
for key, value in params.items():
args.append(str(key))
args.append(str(value))
return args
def local_blastp_query(input_fasta, db, output_path, exclusion_list):
cmd = [
'blastp', '-db', db, '-query', input_fasta,
'-evalue', '1.0', '-outfmt', '5', '-out', output_path,
'-num_threads', '32', '-word_size', '3',
'-matrix', 'BLOSUM62', '-qcov_hsp_perc', '80', '-negative_seqidlist', exclusion_list
]
subprocess.run(cmd, check=True)
def main():
DB_PATH = './mms_smallDB'
OUTPUT_DIR = './mmseqs_output'
if os.path.exists(OUTPUT_DIR):
shutil.rmtree(OUTPUT_DIR)
# Input parameters for MMseqs
MMSEQS_PARAMS = {
'--cluster-mode': 1,
'--cluster-steps': 3,
'--cluster-reassign': 1,
'--cov-mode': 0,
'-c': 0.8,
'-e': 0.001,
'-s': 7,
'--min-seq-id': 0.4,
'--threads': 40,
'--max-seqs': 200,
'--max-iterations': 1000,
'--alignment-mode': 3,
'--similarity-type': 2
}
params = params_to_cmd_args(MMSEQS_PARAMS)
# Cluster using MMseqs
cluster_out, tsv_out = run_mmseqs_clustering(DB_PATH, OUTPUT_DIR, params)
# Map sequences to their respective clusters
cluster_mapping = {}
with open(tsv_out, 'r') as f:
for line in f:
cluster, sequence = line.strip().split('\t')
cluster_mapping[sequence] = cluster
# Sample a sequence from 20 distinct clusters
sampled_sequences = {}
for sequence, cluster in cluster_mapping.items():
if cluster not in sampled_sequences and len(sampled_sequences) < 20:
sampled_sequences[cluster] = sequence
# Create exclusion list
exclusion_list = os.path.join(OUTPUT_DIR, "exclusion_list.txt")
with open(exclusion_list, 'w') as f:
for sequence, cluster in cluster_mapping.items():
if cluster in sampled_sequences:
f.write(f"{sequence}\n")
# write query sequences to a file
# we need to go get the sequences from the original database
index = SeqIO.index('./mmseqs_input.fasta', "fasta")
query_sequences = os.path.join(OUTPUT_DIR, "blast_query_sequences.fasta")
with open (query_sequences, 'w') as f:
for _, sequence in sampled_sequences.items():
f.write(f">{sequence}\n{index[sequence].seq}\n")
# Blast the sampled sequences against the MMseqs clustered database
blast_output = os.path.join(OUTPUT_DIR, "blast_output.xml")
local_blastp_query(query_sequences, './blast_smallDB', blast_output, exclusion_list)
# Parse and return alignments
es = []
perc_identities = []
records_iter = parse(open(blast_output, 'r'))
for record in records_iter:
# get only the best alignment and hsp for each query
# (there should only be one)
try:
alignment = record.alignments[0]
except IndexError:
es.append(None)
perc_identities.append(None)
continue
hsp = alignment.hsps[0]
e = hsp.expect
# check if both strands are covered
# calculate average coverege
coverage = (hsp.query_end - hsp.query_start + 1 + hsp.sbjct_end - hsp.sbjct_start + 1) / (alignment.length + record.query_length)
if coverage < 0.9:
es.append(None)
perc_identities.append(None)
# check that the first hsp of the first align is indeed the best
for align in record.alignments:
for hsp in align.hsps:
if hsp.expect < e:
raise ValueError('Not the best hsp')
es.append(e)
perc_identities.append(hsp.identities / ((alignment.length + record.query_length)/2))
num_aligned = len([e for e in es if e is not None])
# remove nans
es = [e for e in es if e is not None]
perc_identities = [p for p in perc_identities if p is not None]
mean_percid = sum(perc_identities)/len(perc_identities)
mean_e = sum(es)/len(es)
max_percid = max(perc_identities)
min_e = min(es)
result_dict = MMSEQS_PARAMS
result_dict['num_aligned'] = num_aligned
result_dict['mean_percid'] = mean_percid
result_dict['mean_e'] = mean_e
result_dict['max_percid'] = max_percid
result_dict['min_e'] = min_e
# load the current results (tsv) and append the new results
results_path = os.path.join('./', "results.tsv")
if os.path.exists(results_path):
results = pd.read_csv(results_path, sep='\t')
results = results.append(result_dict, ignore_index=True)
else:
results = pd.DataFrame([result_dict])
results.to_csv(results_path, sep='\t', index=False)
if __name__ == "__main__":
main()
MMseqs Output (for bugs)
Here is one output example, though I have run the above script varying the parameters for a number of params.
cluster ./mms_smallDB ./mmseqs_output/mmseq_clu ./mmseqs_output/tmp --cluster-mode 1 --cluster-steps 3 --cluster-reassign 1 --cov-mode 0 -c 0.8 -e 0.001 -s 7 --min-seq-id 0.4 --threads 40 --max-seqs 200 --max-iterations 1000 --alignment-mode 3 --similarity-type 2
MMseqs Version: 14.7e284
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 7
k-mer length 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max sequence length 65535
Max results per query 200
Split database 0
Split mode 2
Split memory limit 0
Coverage threshold 0.8
Coverage mode 0
Compositional bias 1
Compositional bias 1
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Include identical seq. id. false
Spaced k-mers 1
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads 40
Compressed 0
Verbosity 3
Add backtrace false
Alignment mode 3
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0.4
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max reject 2147483647
Max accept 2147483647
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Cluster mode 1
Max connected component depth 1000
Similarity type 2
Single step clustering false
Cascaded clustering steps 3
Cluster reassign true
Remove temporary files false
Force restart with latest tmp false
MPI runner
k-mers per sequence 21
Scale k-mers per sequence aa:0.000,nucl:0.200
Adjust k-mer length false
Shift hash 67
Include only extendable false
Skip repeating k-mers false
Connected component clustering produces less clusters in a single step clustering.
Please use --single-step-clusterlinclust ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/clu_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 0 --force-reuse 0
kmermatcher ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.4 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 40 --compressed 0 -v 3
kmermatcher ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.4 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 40 --compressed 0 -v 3
Database size: 100000 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)
Generate k-mers list for 1 split
[=================================================================] 100.00K 0s 853ms
Sort kmer 0h 0m 0s 789ms
Sort by rep. sequence 0h 0m 0s 951ms
Time for fill: 0h 0m 0s 155ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 3s 166ms
rescorediagonal ./mms_smallDB ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 40 --compressed 0 -v 3
[=================================================================] 100.00K 0s 206ms
Time for merging to pref_rescore1: 0h 0m 0s 757ms
Time for processing: 0h 0m 1s 988ms
clust ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_rescore1 ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pre_clust --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3
Clustering mode: Connected Component
[=================================================================] 100.00K 0s 367ms
Sort entries
Find missing connections
Found 245160 new connections.
Reconstruct initial order
[=================================================================] 100.00K 0s 304ms
Add missing connections
[=================================================================] 100.00K 0s 8ms
Time for read in: 0h 0m 1s 971ms
connected component mode
Total time: 0h 0m 3s 258ms
Size of the sequence database: 100000
Size of the alignment database: 100000
Number of clusters: 31321
Writing results 0h 0m 0s 6ms
Time for merging to pre_clust: 0h 0m 0s 0ms
Time for processing: 0h 0m 3s 597ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/order_redundancy ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/input_step_redundancy -v 3 --subdb-mode 1
Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 14ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/order_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_filter1 -v 3 --subdb-mode 1
Time for merging to pref_filter1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 422ms
filterdb ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_filter1 ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_filter2 --filter-file ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/order_redundancy --threads 40 --compressed 0 -v 3
Filtering using file(s)
[=================================================================] 31.32K 0s 100ms
Time for merging to pref_filter2: 0h 0m 0s 137ms
Time for processing: 0h 0m 0s 847ms
rescorediagonal ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_filter2 ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_rescore2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 40 --compressed 0 -v 3
[=================================================================] 31.32K 0s 42ms
Time for merging to pref_rescore2: 0h 0m 0s 90ms
Time for processing: 0h 0m 0s 772ms
align ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pref_rescore2 ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 40 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 31321 type: Aminoacid
Target database size: 31321 type: Aminoacid
Calculation of alignments
[=================================================================] 31.32K 3s 713ms
Time for merging to aln: 0h 0m 0s 107ms
53166 alignments calculated
45707 sequence pairs passed the thresholds (0.859704 of overall calculated)
1.459308 hits per query sequence
Time for processing: 0h 0m 4s 203ms
clust ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/aln ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/clust --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3
Clustering mode: Connected Component
[=================================================================] 31.32K 0s 215ms
Sort entries
Find missing connections
Found 14386 new connections.
Reconstruct initial order
[=================================================================] 31.32K 0s 218ms
Add missing connections
[=================================================================] 31.32K 0s 1ms
Time for read in: 0h 0m 1s 273ms
connected component mode
Total time: 0h 0m 1s 458ms
Size of the sequence database: 31321
Size of the alignment database: 31321
Number of clusters: 20942
Writing results 0h 0m 0s 96ms
Time for merging to clust: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 683ms
mergeclusters ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/clu_redundancy ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/pre_clust ./mmseqs_output/tmp/5351426679731834765/linclust/262265298633898384/clust --threads 40 --compressed 0 -v 3
Clustering step 1
[=================================================================] 31.32K 0s 36ms
Clustering step 2
[=================================================================] 20.94K 0s 74ms
Write merged clustering
[=================================================================] 100.00K 0s 404ms
Time for merging to clu_redundancy: 0h 0m 0s 145ms
Time for processing: 0h 0m 0s 639ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/clu_redundancy ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy -v 3 --subdb-mode 1
Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 13ms
prefilter ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/pref_step0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 1 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 0 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 40 --compressed 0 -v 3
Query database size: 20942 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 20942 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6
Index table: counting k-mers
[=================================================================] 20.94K 0s 601ms
Index table: Masked residues: 6638
Index table: fill
[=================================================================] 20.94K 0s 645ms
Index statistics
Entries: 1435009
DB size: 496 MB
Avg k-mer size: 0.022422
Top 10 k-mers
GPGGTL 342
LDMPDG 185
LGDYKP 145
DVLDMP 119
PFLEAR 69
PFPEAR 65
FDDTDS 59
ADYTFS 55
LITRGY 55
GPGGTT 44
Time for index table init: 0h 0m 2s 668ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 154
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 20942
Target db start 1 to 20942
[=================================================================] 20.94K 0s 928ms
1.256278 k-mers per position
118 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
8 sequences passed prefiltering per query sequence
3 median result list length
0 sequences with 0 size result lists
Time for merging to pref_step0: 0h 0m 0s 51ms
Time for processing: 0h 0m 6s 669ms
align ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/pref_step0 ./mmseqs_output/tmp/5351426679731834765/aln_step0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 40 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 20942 type: Aminoacid
Target database size: 20942 type: Aminoacid
Calculation of alignments
[=================================================================] 20.94K 15s 380ms
Time for merging to aln_step0: 0h 0m 0s 75ms
172065 alignments calculated
67554 sequence pairs passed the thresholds (0.392607 of overall calculated)
3.225766 hits per query sequence
Time for processing: 0h 0m 16s 166ms
clust ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/aln_step0 ./mmseqs_output/tmp/5351426679731834765/clu_step0 --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3
Clustering mode: Connected Component
[=================================================================] 20.94K 0s 211ms
Sort entries
Find missing connections
Found 98 new connections.
Reconstruct initial order
[=================================================================] 20.94K 0s 218ms
Add missing connections
[=================================================================] 20.94K 0s 1ms
Time for read in: 0h 0m 1s 264ms
connected component mode
Total time: 0h 0m 1s 477ms
Size of the sequence database: 20942
Size of the alignment database: 20942
Number of clusters: 10966
Writing results 0h 0m 0s 66ms
Time for merging to clu_step0: 0h 0m 0s 4ms
Time for processing: 0h 0m 1s 628ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/clu_step0 ./mmseqs_output/tmp/5351426679731834765/input_step_redundancy ./mmseqs_output/tmp/5351426679731834765/input_step1 -v 3 --subdb-mode 1
Time for merging to input_step1: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 7ms
prefilter ./mmseqs_output/tmp/5351426679731834765/input_step1 ./mmseqs_output/tmp/5351426679731834765/input_step1 ./mmseqs_output/tmp/5351426679731834765/pref_step1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 4 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 40 --compressed 0 -v 3
Query database size: 10966 type: Aminoacid
Estimated memory consumption: 1010M
Target database size: 10966 type: Aminoacid
Index table k-mer threshold: 127 at k-mer size 6
Index table: counting k-mers
[=================================================================] 10.97K 0s 560ms
Index table: Masked residues: 4144
Index table: fill
[=================================================================] 10.97K 0s 667ms
Index statistics
Entries: 1798942
DB size: 498 MB
Avg k-mer size: 0.028108
Top 10 k-mers
IGAALA 68
GPGGTL 58
GIVAPG 43
ALTAGI 42
ALGNGK 34
GLGNGK 32
ELPGVN 31
DLLDLP 29
GQQVAR 24
GEQVAR 23
Time for index table init: 0h 0m 2s 664ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 127
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 10966
Target db start 1 to 10966
[=================================================================] 10.97K 3s 91ms
46.510777 k-mers per position
438 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
13 sequences passed prefiltering per query sequence
7 median result list length
0 sequences with 0 size result lists
Time for merging to pref_step1: 0h 0m 0s 41ms
Time for processing: 0h 0m 8s 706ms
align ./mmseqs_output/tmp/5351426679731834765/input_step1 ./mmseqs_output/tmp/5351426679731834765/input_step1 ./mmseqs_output/tmp/5351426679731834765/pref_step1 ./mmseqs_output/tmp/5351426679731834765/aln_step1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 40 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 10966 type: Aminoacid
Target database size: 10966 type: Aminoacid
Calculation of alignments
[=================================================================] 10.97K 9s 362ms
Time for merging to aln_step1: 0h 0m 0s 91ms
128470 alignments calculated
17027 sequence pairs passed the thresholds (0.132537 of overall calculated)
1.552708 hits per query sequence
Time for processing: 0h 0m 9s 872ms
clust ./mmseqs_output/tmp/5351426679731834765/input_step1 ./mmseqs_output/tmp/5351426679731834765/aln_step1 ./mmseqs_output/tmp/5351426679731834765/clu_step1 --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3
Clustering mode: Connected Component
[=================================================================] 10.97K 0s 3ms
Sort entries
Find missing connections
Found 475 new connections.
Reconstruct initial order
[=================================================================] 10.97K 0s 5ms
Add missing connections
[=================================================================] 10.97K 0s 0ms
Time for read in: 0h 0m 0s 613ms
connected component mode
Total time: 0h 0m 0s 705ms
Size of the sequence database: 10966
Size of the alignment database: 10966
Number of clusters: 8338
Writing results 0h 0m 0s 47ms
Time for merging to clu_step1: 0h 0m 0s 4ms
Time for processing: 0h 0m 0s 815ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/clu_step1 ./mmseqs_output/tmp/5351426679731834765/input_step1 ./mmseqs_output/tmp/5351426679731834765/input_step2 -v 3 --subdb-mode 1
Time for merging to input_step2: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 8ms
prefilter ./mmseqs_output/tmp/5351426679731834765/input_step2 ./mmseqs_output/tmp/5351426679731834765/input_step2 ./mmseqs_output/tmp/5351426679731834765/pref_step2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 7 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 40 --compressed 0 -v 3
Query database size: 8338 type: Aminoacid
Estimated memory consumption: 1003M
Target database size: 8338 type: Aminoacid
Index table k-mer threshold: 100 at k-mer size 6
Index table: counting k-mers
[=================================================================] 8.34K 0s 514ms
Index table: Masked residues: 3074
Index table: fill
[=================================================================] 8.34K 0s 572ms
Index statistics
Entries: 1408015
DB size: 496 MB
Avg k-mer size: 0.022000
Top 10 k-mers
GPGGTL 37
GLGNGK 26
ALGNGK 23
DLLDLP 21
FDDTDS 20
NGGSLK 17
DLLDMP 17
DVLDMP 17
GEQVAR 16
FDDTDT 16
Time for index table init: 0h 0m 2s 591ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 100
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 8338
Target db start 1 to 8338
[=================================================================] 8.34K 26s 907ms
903.365687 k-mers per position
4641 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
88 sequences passed prefiltering per query sequence
76 median result list length
0 sequences with 0 size result lists
Time for merging to pref_step2: 0h 0m 0s 36ms
Time for processing: 0h 0m 32s 520ms
align ./mmseqs_output/tmp/5351426679731834765/input_step2 ./mmseqs_output/tmp/5351426679731834765/input_step2 ./mmseqs_output/tmp/5351426679731834765/pref_step2 ./mmseqs_output/tmp/5351426679731834765/aln_step2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 40 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 8338 type: Aminoacid
Target database size: 8338 type: Aminoacid
Calculation of alignments
[=================================================================] 8.34K 17s 958ms
Time for merging to aln_step2: 0h 0m 0s 88ms
489475 alignments calculated
8622 sequence pairs passed the thresholds (0.017615 of overall calculated)
1.034061 hits per query sequence
Time for processing: 0h 0m 18s 545ms
clust ./mmseqs_output/tmp/5351426679731834765/input_step2 ./mmseqs_output/tmp/5351426679731834765/aln_step2 ./mmseqs_output/tmp/5351426679731834765/clu_step2 --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3
Clustering mode: Connected Component
[=================================================================] 8.34K 0s 2ms
Sort entries
Find missing connections
Found 28 new connections.
Reconstruct initial order
[=================================================================] 8.34K 0s 2ms
Add missing connections
[=================================================================] 8.34K 0s 0ms
Time for read in: 0h 0m 0s 408ms
connected component mode
Total time: 0h 0m 0s 491ms
Size of the sequence database: 8338
Size of the alignment database: 8338
Number of clusters: 8185
Writing results 0h 0m 0s 23ms
Time for merging to clu_step2: 0h 0m 0s 4ms
Time for processing: 0h 0m 0s 572ms
mergeclusters ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/clu ./mmseqs_output/tmp/5351426679731834765/clu_redundancy ./mmseqs_output/tmp/5351426679731834765/clu_step0 ./mmseqs_output/tmp/5351426679731834765/clu_step1 ./mmseqs_output/tmp/5351426679731834765/clu_step2
Clustering step 1
[=================================================================] 20.94K 0s 219ms
Clustering step 2
[=================================================================] 10.97K 0s 427ms
Clustering step 3
[=================================================================] 8.34K 0s 657ms
Clustering step 4
[=================================================================] 8.19K 0s 758ms
Write merged clustering
[=================================================================] 100.00K 0s 956ms
Time for merging to clu: 0h 0m 0s 164ms
Time for processing: 0h 0m 1s 268ms
align ./mms_smallDB ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/clu ./mmseqs_output/tmp/5351426679731834765/aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 40 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 100000 type: Aminoacid
Target database size: 100000 type: Aminoacid
Calculation of alignments
[=================================================================] 8.19K 8s 160ms
Time for merging to aln: 0h 0m 0s 15ms
99829 alignments calculated
73771 sequence pairs passed the thresholds (0.738974 of overall calculated)
9.012951 hits per query sequence
Time for processing: 0h 0m 8s 437ms
subtractdbs ./mmseqs_output/tmp/5351426679731834765/clu ./mmseqs_output/tmp/5351426679731834765/aln ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 40 --compressed 0 -v 3
subtractdbs ./mmseqs_output/tmp/5351426679731834765/clu ./mmseqs_output/tmp/5351426679731834765/aln ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted --e-profile 100000000 -e 100000000 --threads 40 --compressed 0 -v 3
Remove ./mmseqs_output/tmp/5351426679731834765/aln ids from ./mmseqs_output/tmp/5351426679731834765/clu
[=================================================================] 8.19K 0s 263ms
Time for merging to clu_not_accepted: 0h 0m 0s 69ms
Time for processing: 0h 0m 0s 514ms
swapdb ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted_swap --threads 40 --compressed 0 -v 3
[=================================================================] 8.19K 0s 2ms
Computing offsets.
[=================================================================] 8.19K 0s 1ms
Reading results.
[=================================================================] 8.19K 0s 1ms
Output database: ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted_swap
[=================================================================] 100.00K 0s 116ms
Time for merging to clu_not_accepted_swap: 0h 0m 0s 143ms
Time for processing: 0h 0m 0s 577ms
subtractdbs ./mmseqs_output/tmp/5351426679731834765/clu ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted ./mmseqs_output/tmp/5351426679731834765/clu_accepted --e-profile 100000000 -e 100000000 --threads 40 --compressed 0 -v 3
subtractdbs ./mmseqs_output/tmp/5351426679731834765/clu ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted ./mmseqs_output/tmp/5351426679731834765/clu_accepted --e-profile 100000000 -e 100000000 --threads 40 --compressed 0 -v 3
Remove ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted ids from ./mmseqs_output/tmp/5351426679731834765/clu
[=================================================================] 8.19K 0s 41ms
Time for merging to clu_accepted: 0h 0m 0s 137ms
Time for processing: 0h 0m 0s 277ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/clu_not_accepted_swap ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned -v 3
Time for merging to seq_wrong_assigned: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 28ms
createsubdb ./mmseqs_output/tmp/5351426679731834765/clu ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/seq_seeds -v 3
Time for merging to seq_seeds: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 16ms
prefilter ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned ./mmseqs_output/tmp/5351426679731834765/seq_seeds.merged ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 7 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 40 --compressed 0 -v 3
Query database size: 26229 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 34414 type: Aminoacid
Index table k-mer threshold: 100 at k-mer size 6
Index table: counting k-mers
[=================================================================] 34.41K 1s 394ms
Index table: Masked residues: 8741
Index table: fill
[=================================================================] 34.41K 1s 378ms
Index statistics
Entries: 6295744
DB size: 524 MB
Avg k-mer size: 0.098371
Top 10 k-mers
DVLDMP 2320
PDVMRM 1368
DRQVAY 1181
PFPEAR 738
MPLGAT 728
MPMGAT 703
GQQVAR 620
ADYTFS 597
LTFLYV 568
VLLALS 518
Time for index table init: 0h 0m 4s 142ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 100
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 26229
Target db start 1 to 34414
[=================================================================] 26.23K 1m 58s 7ms
775.834912 k-mers per position
60277 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
193 sequences passed prefiltering per query sequence
200 median result list length
0 sequences with 0 size result lists
Time for merging to seq_wrong_assigned_pref: 0h 0m 0s 56ms
Time for processing: 0h 2m 5s 612ms
swapdb ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped --threads 40 --compressed 0 -v 3
[=================================================================] 26.23K 0s 396ms
Computing offsets.
[=================================================================] 26.23K 0s 384ms
Reading results.
[=================================================================] 26.23K 0s 441ms
Output database: ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped
[=================================================================] 100.00K 0s 144ms
Time for merging to seq_wrong_assigned_pref_swaped: 0h 0m 0s 19ms
Time for processing: 0h 0m 2s 119ms
align ./mmseqs_output/tmp/5351426679731834765/seq_seeds.merged ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped_aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.4 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 40 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 34414 type: Aminoacid
Target database size: 26229 type: Aminoacid
Calculation of alignments
[=================================================================] 34.29K 6m 32s 543ms
Time for merging to seq_wrong_assigned_pref_swaped_aln: 0h 0m 0s 85ms
4335308 alignments calculated
2294027 sequence pairs passed the thresholds (0.529150 of overall calculated)
66.900757 hits per query sequence
Time for processing: 0h 6m 33s 544ms
filterdb ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped_aln ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped_aln_ocol --trim-to-one-column --threads 40 --compressed 0 -v 3
Filtering using regular expression
[=================================================================] 34.29K 1s 15ms
Time for merging to seq_wrong_assigned_pref_swaped_aln_ocol: 0h 0m 0s 70ms
Time for processing: 0h 0m 1s 765ms
mergedbs ./mmseqs_output/tmp/5351426679731834765/seq_seeds.merged ./mmseqs_output/tmp/5351426679731834765/clu_accepted_plus_wrong ./mmseqs_output/tmp/5351426679731834765/clu_accepted ./mmseqs_output/tmp/5351426679731834765/seq_wrong_assigned_pref_swaped_aln_ocol --merge-stop-empty 0 --compressed 0 -v 3
Merging the results to ./mmseqs_output/tmp/5351426679731834765/clu_accepted_plus_wrong
[=================================================================] 34.41K 0s 26ms
Time for merging to clu_accepted_plus_wrong: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 53ms
tsv2db ./mmseqs_output/tmp/5351426679731834765/missing.single.seqs ./mmseqs_output/tmp/5351426679731834765/missing.single.seqs.db --output-dbtype 6 --compressed 0 -v 3
Output database type: Clustering
Time for merging to missing.single.seqs.db: 0h 0m 0s 12ms
Time for processing: 0h 0m 0s 34ms
mergedbs ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/clu_accepted_plus_wrong_plus_single ./mmseqs_output/tmp/5351426679731834765/clu_accepted_plus_wrong ./mmseqs_output/tmp/5351426679731834765/missing.single.seqs.db --merge-stop-empty 0 --compressed 0 -v 3
Merging the results to ./mmseqs_output/tmp/5351426679731834765/clu_accepted_plus_wrong_plus_single
[=================================================================] 100.00K 0s 35ms
Time for merging to clu_accepted_plus_wrong_plus_single: 0h 0m 0s 12ms
Time for processing: 0h 0m 0s 66ms
clust ./mms_smallDB ./mmseqs_output/tmp/5351426679731834765/clu_accepted_plus_wrong_plus_single ./mmseqs_output/mmseq_clu --cluster-mode 1 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 0 -v 3
Clustering mode: Connected Component
[=================================================================] 100.00K 0s 609ms
Sort entries
Find missing connections
Found 596106 new connections.
Reconstruct initial order
[=================================================================] 100.00K 0s 572ms
Add missing connections
[=================================================================] 100.00K 0s 324ms
Time for read in: 0h 0m 2s 881ms
connected component mode
Total time: 0h 0m 4s 86ms
Size of the sequence database: 100000
Size of the alignment database: 100000
Number of clusters: 8463
Writing results 0h 0m 0s 21ms
Time for merging to mmseq_clu: 0h 0m 0s 0ms
Time for processing: 0h 0m 4s 446ms
createtsv ./mms_smallDB ./mms_smallDB ./mmseqs_output/mmseq_clu ./mmseqs_output/mmseq_clu.tsv
MMseqs Version: 14.7e284
First sequence as representative false
Target column 1
Add full header false
Sequence source 0
Database output false
Threads 40
Compressed 0
Verbosity 3
Time for merging to mmseq_clu.tsv: 0h 0m 0s 64ms
Time for processing: 0h 0m 0s 533ms
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Git commit used (The string after "MMseqs Version:" 14.7e284:
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda
- For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
- Operating system and version:
Perhaps I am reading your example wrong, but Isn't your maximum sequence identity in the tsv table 0.4268774703557312? if so, it would seem that your intercluster sequence identity is below 50%.