MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

RCSB PDB like Sequence Clustering

Open zeynepabali opened this issue 4 years ago • 5 comments

Hi, I am not sure if this is the right place to ask this, but is there a set of options to recreate the same clustering as the ones in the weekly sequence clustering of PDB. As in this link for example: https://cdn.rcsb.org/resources/sequence/clusters/bc-100.out

zeynepabali avatar May 15 '21 16:05 zeynepabali

AFAIK, the PDB is using an MMseqs2 based workflow, but I don't really know what they are doing. @martin-steinegger added some features at the request of the PDB team, he might be able to put you in contact with the right people.

milot-mirdita avatar May 17 '21 12:05 milot-mirdita

Thank you very much. I will try to get in contact with him.

zeynepabali avatar May 21 '21 08:05 zeynepabali

I had contact quite some time with Zukang Feng (https://www.rcsb.org/pages/team) from the PDB. I am actually not sure what parameters they exactly they use at the moment. Maybe it would be good to contact him.

However, I remember that they replaced blastclust. blastclust uses connected component clustering. So you need use --cluster-mode 1 in mmseqs.

mmseqs cluster pdb_seq_pr pdb_seq_pr_clu_s8_maxseqs1000 tmp_clu7 --cov-mode 0 -c 0.90 --min-seq-id 0.3 -s 7 --max-seqs 1000 --cluster-mode 1 -a

martin-steinegger avatar May 21 '21 08:05 martin-steinegger

Hello, have you maybe figured this out?

ZanHP avatar Jan 05 '22 14:01 ZanHP

This is what is used internally at RCSB PDB (with a few different thresholds for sequence identitiy):

mmseqs easy-cluster pdb_protein_sequence.fasta-A.gz session --min-seq-id 0.3 -c 0.9 -s 8 --max-seqs 1000 --cluster-mode 1

josemduarte avatar Jan 05 '22 18:01 josemduarte