ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

AlphaFold2 protein embedding Notebook as example

Open ptynecki opened this issue 4 years ago • 5 comments

Good morning,

I would like to propose you to share another extremely useful example of AF2 usage. Many scientists are using protein embeddings for downstream tasks (i.e. function prediction). AF2 issue described the codebase which gonna access you to the protein embedding vector but many users are not able to handle it by themself.

I hope you will consider my idea, to demonstrate how to load and prepare AF2 minimum setup to execute embedding part of the workflow on Colab or local machine. The most expected example could be AA sequence on the input and fixed-length numerical vector as output (averaged residue vector).

Warm regards, Piotr

ptynecki avatar Aug 09 '21 09:08 ptynecki

HI @ptynecki. I have solved this. The notebook for an older version of colab fold is available on my Github.

I have also edited some code of the current notebooks to extract the embeddings and also for doing a batch run(generate embeddings for thousands of sequences at once). I will upload them soon.

Also, I would recommed using ESM models for the embeddings. In my experience, Alphafold embeddings are better without any doubt but the time taken to generate these embeddings is a bottleneck. For my downstream task, ESM doesn't lag much behind AF2

Thanks, Shashank

xinformatics avatar Aug 13 '21 14:08 xinformatics

Hello, is their any update for the script doing a batch run(generate embeddings for thousands of sequences at once)? I would really appreciate the effort . Thanks in advance.

MdSaifulIslamSajol avatar Aug 06 '24 18:08 MdSaifulIslamSajol

您好,是否有任何针对批量运行(一次生成数千个序列的嵌入)的脚本更新?我非常感谢您的努力。提前致谢。 Hello, has this problem been solved?

GUMI-QXP avatar Oct 12 '24 03:10 GUMI-QXP

HI @ptynecki. I have solved this. The notebook for an older version of colab fold is available on my Github.

I have also edited some code of the current notebooks to extract the embeddings and also for doing a batch run(generate embeddings for thousands of sequences at once). I will upload them soon.

Also, I would recommed using ESM models for the embeddings. In my experience, Alphafold embeddings are better without any doubt but the time taken to generate these embeddings is a bottleneck. For my downstream task, ESM doesn't lag much behind AF2

Thanks, Shashank

Hi @xinformatics, would you have an update on your comment about the runtime to get the AF2 embeddings? I have noticed, although not in a consistent benchmark, that the runtime of ColabFold_batch while exposing the embedding (--save-single-representations) is considerable larger than an execution without saving them. Is there a reason for that behavior? Are there suggestions to speed-up the calculation of the embeddings? Thanks!

cvyaru avatar Jun 26 '25 19:06 cvyaru

Hi @cvyaru, I do not actively work in this area, so anything I say might be old news. I suspect that turning on the return representations command forces those O(L²) tensors in the representation to stay in memory and be copied off-device. That may be the reason it is slow.

xinformatics avatar Jul 12 '25 14:07 xinformatics