How to extract whole-protein structural representations with multi-chain input?
Dear authors,
Thank you for your excellent work on SaProt!
I have a question regarding how to extract the structural representation of a multi-chain protein using your model. Specifically, if a protein contains both chain A and chain B, should I concatenate the sequences of chain A and chain B directly as one input sequence?
I would greatly appreciate your guidance on the recommended way to handle multi-chain proteins for whole-structure representation extraction, especially for tasks such as protein-level classification or embedding.
Looking forward to your advice.
Hi,
I think the way you mentioned that concatenating the sequences of different chains into one and feed it into SaProt is feasible. Besides, you may also obtain the embeddings of different chains individually and then combine them in a embedding-level (in this case you have to additionally fine-tune the model for prediction).
Thank you very much for your response and for providing the pre-trained models. I’ve downloaded the pre-training data you shared, which I understand are all single-chain proteins predicted by AlphaFold2.
I noticed that you also provide a model trained on protein data from PDBbank SaProt_650M_PDB. May I ask how you handled proteins with multiple chains in the PDBbank dataset during pre-processing? Specifically, did you merge the chains, select a particular one, or use another strategy?
Thank you in advance for your time!
We split a multi-chain protein into single-chain proteins and then adopted the same training strategy :)
Thank you!