MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

(Re)producing stable representative clusters across linclust runs

Open hmms117 opened this issue 2 years ago • 1 comments

Hi

Is mmseqs deterministic? When running linclust on a large FASTA file of proteins, one would expect to get very similar clusters when rerunning the same command on the same fasta file (with default linclust parameters, with —min-seq-id 0.95 -c 0.8).

Input: fasta file with ~500mio nearly identical sequences (size slowly incrementing, order of sequences may change). Also tested with exact same sequences where order of sequences changed.

Current Behavior

Notice 10-20% of clusters have changed after each run.

Version: latest daily, ubuntu 20.04, 96 core amd server

Any tricks to produce stable clusters? Kmers per seq, sorting the sequences, etc?

Many thanks!

hmms117 avatar Jan 31 '23 22:01 hmms117

I have a similar issue. It would be helpful to have reproducibility support on clustering.

SimonKitSangChu avatar May 05 '23 23:05 SimonKitSangChu