Is MMseqs2 required before Foldseek clustering?
Hello Foldseek team @milot-mirdita @martin-steinegger
Thank you for the great tool and the very insightful Nature 2023 paper.
I have a few questions regarding the clustering workflow used on the AlphaFoldDB:
In the paper, MMseqs2 was used to cluster the 214M AFDB sequences at 50% identity and 90% overlap, producing 52M representative structures (AFDB50), which were then structurally clustered with Foldseek.
My case: I have around 500,000 new predicted protein structures that are not present in AFDB and are non-redundant with existing structures.
My questions are:
Should I concatenate my 500K structures with the full 214M AlphaFoldDB, run MMseqs2 to obtain updated representatives, and then perform Foldseek clustering?
Or can I concatenate my 500K structures directly with AFDB50 (the 52M representatives) and run Foldseek clustering on the combined set?
In other words, is MMseqs2 a mandatory step to ensure clustering quality or structural coverage, or was it mainly a practical choice to reduce computation by removing redundant sequences?
I want to be sure I’m following the best approach for integrating and clustering my custom dataset.
Thank you very much for your help!
We used MMseqs2 for two main reasons: (1) it is faster than clustering based purely on Foldseek, and (2) it allows us to select better-predicted representative structures (> pLDDT) for structural clustering. Overall, running MMseqs2 first is more of a gain than a compromise—at 50% sequence identity and high coverage, the predicted structures are expected to be similar anyway.
Thanks a lot for the clarification! @martin-steinegger
Just a quick follow-up: Let’s say we cluster directly the full 214M AlphaFoldDB structures with Foldseek, and remove redundancy only afterward manually, for example using a Python script that compares sequences and removes identical entries.
I noticed that in the full AFDB (214M structures), there are around 29 million identical sequence, so I was wondering: Would this post-clustering redundancy removal affect the quality or biological relevance of the clusters (e.g., diversity within clusters, etc.) compared to removing redundancy first with mmseqs2 cluster? Thanks again!