protein-sequence-diffusion-model icon indicating copy to clipboard operation
protein-sequence-diffusion-model copied to clipboard

Collaborate

Open Amelie-Schreiber opened this issue 2 years ago • 7 comments

I'm very interested in replicating your work and would like to train a diffusion model to generate protein binding partners similar to what RFDiffusion accomplishes, but I would like to use ESM-2 models as you have done. If you are open to collaborating, feel free to reach out if you have the time. Also, would you be able to create a tutorial similar to this?

Amelie-Schreiber avatar Sep 01 '23 00:09 Amelie-Schreiber

hi there! I am open to collaboration on interesting works. You may want to discuss your ideas and implementation details with me?

best, zhangzhi

pengzhangzhi avatar Sep 01 '23 04:09 pengzhangzhi

Hi, I am relatively new to training diffusion models. I have only fine-tuned ESM-2 models for sequence classification and for token classification. Are you using EsmForProteinFolding as the backbone in your diffusion model? If so, I don't believe I have access to a good enough GPU to train it. My GPUs are too small unless a smaller model can be used. I hope that I am wrong, or that another ESM-2 model can be used that is smaller. Otherwise I am stuck and unable to train. I am having trouble understanding your code also and was hoping we might work on writing a notebook similar to this: https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb

Thanks for responding! Amelie

On Thu, Aug 31, 2023 at 9:43 PM Zhangzhi Peng @.***> wrote:

hi there! I am open to collaboration on interesting works. You may want to discuss your ideas and implementation details with me?

best, zhangzhi

— Reply to this email directly, view it on GitHub https://github.com/pengzhangzhi/protein-sequence-diffusion-model/issues/2#issuecomment-1702149381, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMIK6IGP3CHAHK3NDFWIGATXYFRYBANCNFSM6AAAAAA4G2PBNE . You are receiving this because you authored the thread.Message ID: @.*** com>

Amelie-Schreiber avatar Sep 01 '23 07:09 Amelie-Schreiber

hi, the training is pretty cheap. I can fit the model in a 10g GPU. Regarding the documentation, please follow the readme to install pkgs and train the model. Please let me know which parts confuse you.

best, Zhangzhi

pengzhangzhi avatar Sep 01 '23 14:09 pengzhangzhi

Could you find me on discord? Also, could I use Hugging Face's accelerator to do data parallelization to split training across two 8GB GPUs? If so, that might work...

EDIT: I've tried training on a P100 GPU (using a colab instance) and it doesn't seems to work. My training script must not be setup correctly or something.

Amelie-Schreiber avatar Sep 01 '23 22:09 Amelie-Schreiber

Hi,

  • I don't have discord, sorry.
  • I have not tested the code on 8g gpus. By reducing the batch size, the memory consumption would be reduced to fit in the 8g memory you have.
  • I use accelerator a lot; it is very simple and easy to use. Data parallelization may work in that case.
  • You can call Python scripts and functions from a script in notebook.

pengzhangzhi avatar Sep 03 '23 21:09 pengzhangzhi

Hi! I tried following the install instruction and I am having some issues. First, there seems to be a mistake in the install instructions. I believe you need

cd protein-sequence-diffusion-model

instead of

cd denoising_diffusion_protein_sequence

Also. Once everything is installed, I am getting the following error:

(esm2d) C:\Users\OWO\Desktop\amelie_vscode\esmd\protein-sequence-diffusion-model\denoising_diffusion_pytorch>python pl_train.py --max_epochs 1 --fas_dpath seq_data/fas
C:\Users\OWO\anaconda3\envs\esm2d\lib\site-packages\Bio\pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
C:\Users\OWO\anaconda3\envs\esm2d\lib\site-packages\torchaudio\backend\utils.py:74: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
seq_data/fas\seqs.a3m already exists.
Traceback (most recent call last):
  File "C:\Users\OWO\Desktop\amelie_vscode\esmd\protein-sequence-diffusion-model\denoising_diffusion_pytorch\pl_train.py", line 205, in <module>
    train(args)
  File "C:\Users\OWO\Desktop\amelie_vscode\esmd\protein-sequence-diffusion-model\denoising_diffusion_pytorch\pl_train.py", line 187, in train
    trainer = pl.Trainer(
  File "C:\Users\OWO\anaconda3\envs\esm2d\lib\site-packages\pytorch_lightning\utilities\argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'gpus'

Amelie-Schreiber avatar Sep 06 '23 04:09 Amelie-Schreiber

I guess the error is because the pytorch lightning version is updated and they stop using gpus as an argument. please set accelerator="auto" https://lightning.ai/docs/pytorch/stable/common/trainer.html

use trainer = pl.Trainer(max_epochs=20,accelerator="auto") Ref: https://stackoverflow.com/a/76193000

pengzhangzhi avatar Sep 06 '23 14:09 pengzhangzhi