Discrepancy in Foldseek clustering results compared to AFDB clusters
Hello,
I concatenated my database containing approximately 600k protein structures with the AlphaFold database I downloaded using foldseek databases. I then performed clustering on the concatenated database using default parameters, resulting in approximately 2.7 million clusters.
foldseek concatdbs /data/foldseek/openprot_proteins_db /data/foldseek/sp /data/foldseek/concat_db foldseek concatdbs /data/foldseek/openprot_proteins_db_ca /data/foldseek/sp_ca /data/foldseek/concat_db_ca foldseek concatdbs /data/foldseek/openprot_proteins_db_h /data/foldseek/sp_h /data/foldseek/concat_db_h foldseek concatdbs /data/foldseek/openprot_proteins_db_ss /data/foldseek/sp_ss /data/foldseek/concat_db_ss
foldseek cluster /data/foldseek/concat_db /data/cluster_results $1/tmp_clusters -k 7 --threads 64
When comparing my clusters with yours, I noticed that only 270k clusters (about 10%) have the same representative protein as yours. The remaining clusters have different representative proteins. Do you know why there is this discrepancy? I am aware that in your analysis, the representative protein is chosen based on the highest pLDDT, and this is done by MMseqs2. However, I did not use MMseqs2 for clustering; I directly used Foldseek. Could this explain the difference in results? If so, on what criteria does Foldseek base its choice of representative protein?
Next, I took a closer look at the 270k clusters that are common with your results. I annotated these clusters as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then wanted to directly add the Pfam and GO annotations you provided to associate them with my common clusters, using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I could not find all of your clusters marked as annotated. Could you please explain in more detail the content of these files? It seems that at the end of your analysis, only 700k clusters out of the total 2.3M clusters were considered dark clusters. However, I cannot find the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev.
Thank you in advance for your help!
Hello @martin-steinegger @milot-mirdita ,
I hope this message finds you well. I wanted to follow up on an issue I posted earlier regarding the clustering results and annotations. I understand that you may be busy, but I would greatly appreciate your insights on the following points:
Cluster Discrepancy:
I concatenated my database of ~600k protein structures with the AlphaFold database using Foldseek and performed clustering with default parameters, resulting in ~2.7M clusters. However, when comparing my clusters with yours, only ~270k clusters (10%) share the same representative protein. Since I used Foldseek directly for clustering (not MMseqs2), could this explain the difference? If so, on what criteria does Foldseek base its choice of representative protein?
Annotation Files:
I examined the 270k clusters that overlap with your results and annotated them as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then tried to add Pfam and GO annotations using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I couldn't find all of your clusters marked as annotated. Could you clarify the content of these files and how they relate to the final annotated clusters?
Missing Annotated Clusters:
It seems that only ~700k clusters out of the total 2.3M clusters were classified as dark clusters in your analysis. However, I couldn't locate the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev. Could you provide more details on how to access or interpret this information?
Your guidance would be invaluable in helping me understand these discrepancies and properly annotate my clusters. Thank you in advance for your time and support!
Best regards,
Cluster Discrepancy: I concatenated my database of ~600k protein structures with the AlphaFold database using Foldseek and performed clustering with default parameters, resulting in ~2.7M clusters. However, when comparing my clusters with yours, only ~270k clusters (10%) share the same representative protein. Since I used Foldseek directly for clustering (not MMseqs2), could this explain the difference? If so, on what criteria does Foldseek base its choice of representative protein?
During the clustering, foldseek uses Linclust algorithm. And Linclust chooses the nodes that have most aligned hits as the representatives from the first its comparison between entries. This is depicted in mmseqs2 wiki. So the concatenation can affect the representatives because it can change the number of hits for entries.
Annotation Files: I examined the 270k clusters that overlap with your results and annotated them as "annotated" or "non-annotated" (dark clusters) based on your file 2-repId_isDark_nMem_repLen_avgLen_repPlddt_avgPlddt_LCAtaxId.tsv.gz. I then tried to add Pfam and GO annotations using the files 4-domain-clustering.zip and 3-sapId_sapGO_repId_cluFlag_LCAtaxId.tsv.gz. However, I couldn't find all of your clusters marked as annotated. Could you clarify the content of these files and how they relate to the final annotated clusters?
At my understanding, you are trying to define the annotated and non-annotated clusters from our dark clusters. But as mentioned earlier, your representative proteins can be different from ours. The file number 4 contains the all-vs-all aligned results and file no.3 contains the gene ontology annotations only on human proteins. So it doesn't cover the all AFDB entries.
Missing Annotated Clusters: It seems that only ~700k clusters out of the total 2.3M clusters were classified as dark clusters in your analysis. However, I couldn't locate the remaining annotated clusters in your data at afdb-cluster.steineggerlab.workers.dev. Could you provide more details on how to access or interpret this information?
The rests other than dark clusters are all the clusters with any annotation - Pfam, TIGRfam