MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Why divergent sequences cluster together using mmseqs2 easy-cluster?

Open CarlosSantanaMolina opened this issue 2 years ago • 1 comments

Dear,

I am using mmseqs2 to remove redundant sequences and isoforms from eukaryotic proteomes. However, we obtained some unexpected and undesired clusters and we would like to understand what is going on. We would like to know if this result is actually the expected one (which I'm not sure beucase both clustered seqs are divergent), or if this is a bug, or something that we do not consider in the command line.

We used the proteome of Trichoplax adhaerens (https://www.uniprot.org/proteomes/UP000009022).

## Expected Behavior We have focused on two sequences as example that should be in different clusters, B3RQY5 and B3S420. The sequence similarity of both sequences is the following (aligned with needle):

  • Length: 643
  • Identity: 146/643 (22.7%)
  • Similarity: 199/643 (30.9%)
  • Gaps: 358/643 (55.7%)
  • Score: 682.5

B3S420 DHVFANENDNSDVYQKVASPIVTAAMEGFNGTIFAYGQTSSGKTHTMMGNHNDPGVIPLA VNEIFRYINQKPNREFLLRVSYMEIYNEVITDLLNPSNTNLKIHENQKKEVYVGSLTENI VNSPSQILTIMTQGETHRHTGGTNMNERSSRSHTIFRMIIESREQNQDQNEADQDTAVKV SALNLVDLAGSERVSQTGSEGIRLKEGGFINKSLLTLGSVIAKLSEGEGNFIPFRDSKLT RILQSSLGGNALTAIICTVTPVSLDETSSTLKFASRAKKIKNKPEVNEVVDDE B3RQY5 MNSEDACNIRVVCRVRPLNSAETHAGSEFIPKFPNEGQIVLSGKSFSFDHVLNSSTNQQS MYDIAAKPIVKDVLAGYNGTIFAYGQTSSGKTHTMEGVIGDPEWQGIIPRIIGDIFAYIY TMDENLEFHIKVSYFEIYMDKIRDLLDVTKTNLAVHEDKNRIPYVKNITERFVSSPEEVF EIIDEGKSNRHVAVTNMNEHSSRSHSIFLIHIKQENVETHKSVHGKLYLVDLAGSEKVSK TGAEGMVLDEAKNINKSLSALGNVISALSEATKSHVPYRDSKLTRILQESLGGNARTTII ICCSPSSINESETKTTLQFGARAKTIKNSVKVNEELPAEEWKRRYEKEREKSSRIKRVLE NYELELKKWRDGENVPVNEQAGGKDEGKLTSNHSTSKINIADALGESERVQFDEERNRLY EQIDEKDDELNNRNTLIEQLRRQLEDKDEEFHLIKNESTRRQAQINALEDELQDSKDEVK EVLNALEELYVNFDEKSRQLEVKSQEYEKNLEELMGIKSKLSNMEENFEETKDTEKRYKR RVTESIKNLLQDMHEIGDVLQDEELKTAIAKDSEKVSDEELTLARLHFGKIKGELKILVS RNHTIESERAELEKKLNVSEANLSENQLLLTEACF

## Current Behavior Both sequences are in the same cluster using easy-cluster.

## Steps to Reproduce (for bugs) We used this command to obtain such result: mmseqs easy-cluster TAHD.fasta. TAHD tmp --cluster-mode 2 --cov-mode 1 -c 1

## Context We have tried to make the clustering with the option easy-linclust, and works 'fine', i.e. both sequences are in different clusters.

## Your Environment MMseqs2 Version: 113e3212c137d026e297c7540e1fcd039f6812b1 Using mmseq binary from Eggnog mapper In a HPC cluster. (the same result is replicated in other systems)

Thank you very much in advance, Carlos Santana Molina

CarlosSantanaMolina avatar Mar 01 '22 16:03 CarlosSantanaMolina

This was also a while ago, however for clustering you should pretty much always supply a sequence identity threshold with --min-seq-id.

The cascaded clustering of MMseqs2 can still put together sequence outside the given thresholds. We have a separate parameter --cluster-reassign 1 to "fix" the clustering after the cascaded clustering.

milot-mirdita avatar Jul 05 '22 02:07 milot-mirdita