openfold icon indicating copy to clipboard operation
openfold copied to clipboard

The training speed is very slow

Open llwx593 opened this issue 1 year ago • 2 comments

Hi, I try to train openfold without distillation datasets. But I found the training speed is very slow. The following table is my running data.

Hardware Trainging time(one epoch) steps per epoch
8 A100(40GB) 4 hours 1250
4 * 8 A100(40GB) 1 hours 313

The total number of "initial_training" epoch is 50000. Assuming a linear acceleration ratio, the training time is more than 500 days when 128 cards are used. So I think I may have made some mistakes.The following is my startup script and deepspeed configuration, and I haven't made any changes to the code. image image

llwx593 avatar Aug 11 '22 09:08 llwx593

  1. Increase the train_epoch_len from the default to avoid running validation all the time
  2. You're training in full precision. To train in half precision, make sure to pass --precision bf16 in addition to specifying that in the DeepSpeed config.
  3. Look into using the alignment_index data format instead of hundreds of thousands of alignment files on disk (see scripts/alignment_dbs and the accompanying note in the README).
  4. Make sure you're allocating many (8-16) CPU cores per GPU to maximize data throughput.

LMK how that goes.

gahdritz avatar Aug 11 '22 20:08 gahdritz

Hello @gahdritz, can you provid a sample training command for Multi-machine multi-card distributed training that openfold use. I use torch.distributed.launch in pytorch-DDP framework. In this Issue, @llwx593 use deepspeed to launch. image The pergormance I use 4*8 A100 80G training in pytorch-DDP framework image

zhangyuxuan1996 avatar Aug 13 '22 01:08 zhangyuxuan1996

  1. Increase the train_epoch_len from the default to avoid running validation all the time
  2. You're training in full precision. To train in half precision, make sure to pass --precision bf16 in addition to specifying that in the DeepSpeed config.
  3. Look into using the alignment_index data format instead of hundreds of thousands of alignment files on disk (see scripts/alignment_dbs and the accompanying note in the README).
  4. Make sure you're allocating many (8-16) CPU cores per GPU to maximize data throughput.

LMK how that goes.

Thanks. I found that I miscalculated the total number of epochs.

llwx593 avatar Sep 27 '22 06:09 llwx593

@zhangyuxuan1996 have you modified some hyperparameters? I run with the default settings using 8 A100 80G but got this: image image I would truly appreciate it if you could give me some advice.

player1321 avatar Jan 26 '23 03:01 player1321

@player1321 Could you share your training command, training data, etc.?

gahdritz avatar Jan 29 '23 04:01 gahdritz

@gahdritz I'm using RODA data, and here is the training command:

mmcif_dir=/alphafold_data/pdb_mmcif/mmcif_files/
alignment_dir=/openfold_data/pdb/
template_mmcif_dir=/alphafold_data/pdb_mmcif/mmcif_files/
validation_data_dir=/cameo_data/data_dir/
validation_alignment_dir=/cameo_data/alignment_dir/
distillation_data_dir=/openfold_data/uniclust30_data/
distillation_alignment_dir=/openfold_data/uniclust30_alignment/
output_dir=results/

python3 train_openfold.py ${mmcif_dir} ${alignment_dir} ${template_mmcif_dir} ${output_dir} \    
2021-10-10 \    
--template_release_dates_cache_path mmcif_cache.json \    
--precision bf16 \   
--gpus 8 
--replace_sampler_ddp=True \    
--seed 4242022 \    
--deepspeed_config_path deepspeed_config.json \    
--checkpoint_every_epoch \    
--train_chain_data_cache_path chain_data_cache.json \    
--obsolete_pdbs_file_path /data/biomed/alphafold/pdb_mmcif/obsolete.dat \    
--log_lr \    
--config_preset initial_training \    
--val_data_dir ${validation_data_dir} \    
--val_alignment_dir ${validation_alignment_dir} \    
--distillation_data_dir ${distillation_data_dir} \    
--distillation_alignment_dir ${distillation_alignment_dir} \    
--distillation_chain_data_cache_path uniclust30_chain_data_cache.json

uniclust30_chain_data_cache.json was generated by:

python3 scripts/generate_chain_data_cache.py \
   /openfold_data/uniclust30_data/ \
   uniclust30_chain_data_cache.json \
   --no_workers 128

player1321 avatar Jan 30 '23 02:01 player1321

Could you share the file structure of /openfold_data/pdb/?

gahdritz avatar Jan 31 '23 05:01 gahdritz

@gahdritz the file structure is

pdb
├─7bkp_A
   ├── a3m
   │   ├── bfd_uniclust_hits.a3m
   │   ├── mgnify_hits.a3m
   │   └── uniref90_hits.a3m
   └── hhr
        └── pdb70_hits.hhr
...

In RODA it's s3://openfold/pdb/7bkp_A

player1321 avatar Jan 31 '23 05:01 player1321

@gahdritz One more question, I found that you are using less dropout in the model: image The original alphafold2 uses dropout ops before each residual addition, while you only use dropout after triangle update modules, triangle self-attention modules and row-wise gated self-attention, that is, you removed 4 dropout ops for each evoformer layer, is this correct? If so, would this contribute to faster convergence?

player1321 avatar Jan 31 '23 09:01 player1321

I see what's happening. Check out my response on #258. You need to reformat the OpenProteinSet data a little bit in order to feed it to OpenFold. Right now you're effectively training without MSAs, which explains the low performance.

gahdritz avatar Feb 02 '23 20:02 gahdritz