diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Optimal clustering parameters

Open Sidduppal opened this issue 1 year ago • 1 comments

Hey, I'm trying protein clustering for the first time and need some advice. I would like to cluster bacterial proteins from 20 different metagenomes to obtain representative clusters across all samples. As a starting point, I noticed that the Alphafold2 BFD data as well as your preprint, used a clustering approach with 30% sequence identity and 90% coverage. I am considering using these parameters as well, but I wanted to know whether they are suitable for my analysis or if there are other recommended values.

Additionally, is it possible to explain what "coverage" means in the context of clustering? From my understanding, it refers to the minimum coverage required for a protein to be included in a cluster. Is that correct?

Thanks!

Sidduppal avatar Jun 08 '23 23:06 Sidduppal