openfold
openfold copied to clipboard
The training speed is very slow
Hi, I try to train openfold without distillation datasets. But I found the training speed is very slow. The following table is my running data.
Hardware | Trainging time(one epoch) | steps per epoch |
---|---|---|
8 A100(40GB) | 4 hours | 1250 |
4 * 8 A100(40GB) | 1 hours | 313 |
The total number of "initial_training" epoch is 50000. Assuming a linear acceleration ratio, the training time is more than 500 days when 128 cards are used. So I think I may have made some mistakes.The following is my startup script and deepspeed configuration, and I haven't made any changes to the code.
- Increase the train_epoch_len from the default to avoid running validation all the time
- You're training in full precision. To train in half precision, make sure to pass
--precision bf16
in addition to specifying that in the DeepSpeed config. - Look into using the
alignment_index
data format instead of hundreds of thousands of alignment files on disk (seescripts/alignment_dbs
and the accompanying note in the README). - Make sure you're allocating many (8-16) CPU cores per GPU to maximize data throughput.
LMK how that goes.
Hello @gahdritz, can you provid a sample training command for Multi-machine multi-card distributed training that openfold use. I use torch.distributed.launch in pytorch-DDP framework. In this Issue, @llwx593 use deepspeed to launch.
The pergormance I use 4*8 A100 80G training in pytorch-DDP framework
- Increase the train_epoch_len from the default to avoid running validation all the time
- You're training in full precision. To train in half precision, make sure to pass
--precision bf16
in addition to specifying that in the DeepSpeed config.- Look into using the
alignment_index
data format instead of hundreds of thousands of alignment files on disk (seescripts/alignment_dbs
and the accompanying note in the README).- Make sure you're allocating many (8-16) CPU cores per GPU to maximize data throughput.
LMK how that goes.
Thanks. I found that I miscalculated the total number of epochs.
@zhangyuxuan1996 have you modified some hyperparameters? I run with the default settings using 8 A100 80G but got this:
I would truly appreciate it if you could give me some advice.
@player1321 Could you share your training command, training data, etc.?
@gahdritz I'm using RODA data, and here is the training command:
mmcif_dir=/alphafold_data/pdb_mmcif/mmcif_files/
alignment_dir=/openfold_data/pdb/
template_mmcif_dir=/alphafold_data/pdb_mmcif/mmcif_files/
validation_data_dir=/cameo_data/data_dir/
validation_alignment_dir=/cameo_data/alignment_dir/
distillation_data_dir=/openfold_data/uniclust30_data/
distillation_alignment_dir=/openfold_data/uniclust30_alignment/
output_dir=results/
python3 train_openfold.py ${mmcif_dir} ${alignment_dir} ${template_mmcif_dir} ${output_dir} \
2021-10-10 \
--template_release_dates_cache_path mmcif_cache.json \
--precision bf16 \
--gpus 8
--replace_sampler_ddp=True \
--seed 4242022 \
--deepspeed_config_path deepspeed_config.json \
--checkpoint_every_epoch \
--train_chain_data_cache_path chain_data_cache.json \
--obsolete_pdbs_file_path /data/biomed/alphafold/pdb_mmcif/obsolete.dat \
--log_lr \
--config_preset initial_training \
--val_data_dir ${validation_data_dir} \
--val_alignment_dir ${validation_alignment_dir} \
--distillation_data_dir ${distillation_data_dir} \
--distillation_alignment_dir ${distillation_alignment_dir} \
--distillation_chain_data_cache_path uniclust30_chain_data_cache.json
uniclust30_chain_data_cache.json was generated by:
python3 scripts/generate_chain_data_cache.py \
/openfold_data/uniclust30_data/ \
uniclust30_chain_data_cache.json \
--no_workers 128
Could you share the file structure of /openfold_data/pdb/?
@gahdritz the file structure is
pdb
├─7bkp_A
├── a3m
│ ├── bfd_uniclust_hits.a3m
│ ├── mgnify_hits.a3m
│ └── uniref90_hits.a3m
└── hhr
└── pdb70_hits.hhr
...
In RODA it's s3://openfold/pdb/7bkp_A
@gahdritz One more question, I found that you are using less dropout in the model:
The original alphafold2 uses dropout ops before each residual addition, while you only use dropout after triangle update modules, triangle self-attention modules and row-wise gated self-attention, that is, you removed 4 dropout ops for each evoformer layer, is this correct? If so, would this contribute to faster convergence?
I see what's happening. Check out my response on #258. You need to reformat the OpenProteinSet data a little bit in order to feed it to OpenFold. Right now you're effectively training without MSAs, which explains the low performance.