AlphaFold2 protein embedding Notebook as example
Good morning,
I would like to propose you to share another extremely useful example of AF2 usage. Many scientists are using protein embeddings for downstream tasks (i.e. function prediction). AF2 issue described the codebase which gonna access you to the protein embedding vector but many users are not able to handle it by themself.
I hope you will consider my idea, to demonstrate how to load and prepare AF2 minimum setup to execute embedding part of the workflow on Colab or local machine. The most expected example could be AA sequence on the input and fixed-length numerical vector as output (averaged residue vector).
Warm regards, Piotr
HI @ptynecki. I have solved this. The notebook for an older version of colab fold is available on my Github.
I have also edited some code of the current notebooks to extract the embeddings and also for doing a batch run(generate embeddings for thousands of sequences at once). I will upload them soon.
Also, I would recommed using ESM models for the embeddings. In my experience, Alphafold embeddings are better without any doubt but the time taken to generate these embeddings is a bottleneck. For my downstream task, ESM doesn't lag much behind AF2
Thanks, Shashank
Hello, is their any update for the script doing a batch run(generate embeddings for thousands of sequences at once)? I would really appreciate the effort . Thanks in advance.
您好,是否有任何针对批量运行(一次生成数千个序列的嵌入)的脚本更新?我非常感谢您的努力。提前致谢。 Hello, has this problem been solved?
HI @ptynecki. I have solved this. The notebook for an older version of colab fold is available on my Github.
I have also edited some code of the current notebooks to extract the embeddings and also for doing a batch run(generate embeddings for thousands of sequences at once). I will upload them soon.
Also, I would recommed using ESM models for the embeddings. In my experience, Alphafold embeddings are better without any doubt but the time taken to generate these embeddings is a bottleneck. For my downstream task, ESM doesn't lag much behind AF2
Thanks, Shashank
Hi @xinformatics, would you have an update on your comment about the runtime to get the AF2 embeddings? I have noticed, although not in a consistent benchmark, that the runtime of ColabFold_batch while exposing the embedding (--save-single-representations) is considerable larger than an execution without saving them. Is there a reason for that behavior? Are there suggestions to speed-up the calculation of the embeddings? Thanks!
Hi @cvyaru, I do not actively work in this area, so anything I say might be old news. I suspect that turning on the return representations command forces those O(L²) tensors in the representation to stay in memory and be copied off-device. That may be the reason it is slow.