MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

(Question) leverage mmseqs for clustering with defined number of clusters?

Open paoslaos opened this issue 1 year ago • 2 comments

Dear developers,

apologies if this is a naive question. Are there any recommended approaches or mmseqs settings / output files that would facilitate to cluster the input sequences into a user defined number of clusters?

Thank you!

paoslaos avatar Jan 17 '24 19:01 paoslaos

We don't implement any clustering like that, as its usually biologically not very meaningful.

You can compute a sparse all-vs-all search and cluster based on scores with whatever clustering algorithm you prefer that, e.g. scikit-learn implements. You might want to increase --num-seqs in this case though, to report more than the top-300 alignments.

milot-mirdita avatar Jan 22 '24 05:01 milot-mirdita

Thanks for your answer, this is an interesting problem for many machine learning applications to avoid homology leakage. Here biology is not so important (for me at least). We want to be as fair as possible in this case.

So, if I understand correctly, this will do some prefiltering and then give back sparse similarity values which is indeed something that can be used for this purpose.

Is this still the recommended way to do this, from the user guide?

fake_pref qdb tdb allvsallpref
mmseqs align qdb tdb allvsallpref allvsallaln
mmseqs convertalis qdb tdb allvsallaln allvsall.m8

Thank you! Sincerly, P.

paoslaos avatar May 29 '24 09:05 paoslaos