foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

Is MMseqs2 required before Foldseek clustering?

Open YFeriel opened this issue 9 months ago • 2 comments

Hello Foldseek team @milot-mirdita @martin-steinegger

Thank you for the great tool and the very insightful Nature 2023 paper.

I have a few questions regarding the clustering workflow used on the AlphaFoldDB:

In the paper, MMseqs2 was used to cluster the 214M AFDB sequences at 50% identity and 90% overlap, producing 52M representative structures (AFDB50), which were then structurally clustered with Foldseek.

My case: I have around 500,000 new predicted protein structures that are not present in AFDB and are non-redundant with existing structures.

My questions are:

Should I concatenate my 500K structures with the full 214M AlphaFoldDB, run MMseqs2 to obtain updated representatives, and then perform Foldseek clustering?

Or can I concatenate my 500K structures directly with AFDB50 (the 52M representatives) and run Foldseek clustering on the combined set?

In other words, is MMseqs2 a mandatory step to ensure clustering quality or structural coverage, or was it mainly a practical choice to reduce computation by removing redundant sequences?

I want to be sure I’m following the best approach for integrating and clustering my custom dataset.

Thank you very much for your help!

YFeriel avatar Apr 02 '25 19:04 YFeriel

We used MMseqs2 for two main reasons: (1) it is faster than clustering based purely on Foldseek, and (2) it allows us to select better-predicted representative structures (> pLDDT) for structural clustering. Overall, running MMseqs2 first is more of a gain than a compromise—at 50% sequence identity and high coverage, the predicted structures are expected to be similar anyway.

martin-steinegger avatar Apr 04 '25 02:04 martin-steinegger

Thanks a lot for the clarification! @martin-steinegger

Just a quick follow-up: Let’s say we cluster directly the full 214M AlphaFoldDB structures with Foldseek, and remove redundancy only afterward manually, for example using a Python script that compares sequences and removes identical entries.

I noticed that in the full AFDB (214M structures), there are around 29 million identical sequence, so I was wondering: Would this post-clustering redundancy removal affect the quality or biological relevance of the clusters (e.g., diversity within clusters, etc.) compared to removing redundancy first with mmseqs2 cluster? Thanks again!

YFeriel avatar Apr 04 '25 17:04 YFeriel