openfold icon indicating copy to clipboard operation
openfold copied to clipboard

About Initial Training and Finetune Training Speed

Open GuoxiaWang opened this issue 3 years ago • 3 comments

Environment:

pytorch-lightning==1.5.10 torch==1.10.0+cu113 deepspeed==0.5.9 CUDA Version: 11.2.142 Driver Version: 460.32.03 CUDNN: 8.20.00 [torch.backends.cudnn.version(), 8200] NCCL: 2.10.3 [torch.cuda.nccl.version(), (2, 10, 3)]

git commit 24118e804535515398e77010e11b1312f6ff1b62 & Date: Fri May 13 13:59:38 2022 -0400

Install command:

git clone https://github.com/aqlaboratory/openfold.git
python setup.py develop

Training Scripts:

# --config_preset initial_training or finetuning
# only one sample in training dataset
# note: template_mmcif_dir is equal to mmcif_dir
python train_openfold.py data/mmcif/ data/alignment/ data/mmcif/ output_dir \
	2021-10-10 \
   --config_preset finetuning \
   --template_release_dates_cache_path data/mmcif_cache.json \
   --precision bf16 \
   --deepspeed_config_path deepspeed_config.json \
   --gpus 1 \
   --seed 42 \
   --checkpoint_every_epoch \
   --train_chain_data_cache_path data/chain_data_cache.json \
   --train_mapping_path data/protein_id.mapping

The performance results:

config_preset DType GPUs(40G) Step Time (s/step) Min Memory Max Memory
initial_training BF16 1*A100 1000 10.28 5162MiB / 40536MiB 10960MiB / 40536MiB
finetuning BF16 1*A100 340 40.82 28040MiB / 40536MiB 39628MiB / 40536MiB

initial_training Screenshot: image-20220516150634147

finetuning Screenshot: image-20220516194342476

Can you confirm if the speed is right? I

GuoxiaWang avatar May 16 '22 14:05 GuoxiaWang

These times are a lot slower than what I'm seeing on our A100s. I'm getting ~8.9 seconds per initial_training iteration during our current 40 GPU training run. Single-GPU times are faster, closer to 8 seconds per iteration. Single-GPU finetuning iterations take about 20-25 seconds each.

I'm effectively using the same command as you, just with the --benchmark flag, which runs CUDNN benchmarking, purportedly speeding up the network a little.

I have

pytorch-lightning==1.5.10 torch==1.10.1+cu113 deepspeed==0.5.9 CUDA Version: 11.4 Driver Version: 470.82.01 CUDNN: 8.2.4.15 NCCL: 2.10.3

Others have reported faster times still; see e.g. https://arxiv.org/pdf/2203.00854.pdf, where they claim times of 6.2s/20.7s for the two respective config settings on 128 A100s. A separate third party has also reported distributed A100 times between 6 and 7 seconds to me. I've personally been unable to reach such times.

Some questions:

  1. ~~Are you using the most recent OpenFold commit?~~ (just noticed you gave a hash)
  2. How many dataloader workers are you using? Data processing seems to be the closest thing to a bottleneck for us.
  3. If you have the time, could you run some simple profiles? E.g. the forward pass time, the time for generating each batch, the default torch profiler.

gahdritz avatar May 16 '22 19:05 gahdritz

@gahdritz

Answer the question: 2. I use the default config, so I use num_workers=16. I try to save a new_batch.pickle and use data_module = DummyDataLoader("new_batch.pickle") and get the same speed. 3. I am newer for openfold, I think I need some time to add profile code. If I product profiler files and paste to here.

I try to reproduce your performance "closer to 8 seconds per iteration for initial_training and 20-25 seconds per iteration for finetuning on single-GPU.

When I set max_recycling_iters=0 in config.py to disable recycle. I got the closer results. So If you set max_recycling_iters=0 to test the performance?

config_preset DType GPUs(40G) max_recycling_iters Step Time (s/step) Min Memory Max Memory
initial_training BF16 1*A100 3 1000 10.28 5162MiB / 40536MiB 10960MiB / 40536MiB
finetuning BF16 1*A100 3 340 40.82 28040MiB / 40536MiB 39628MiB / 40536MiB
initial_training BF16 1*A100 0 100 8.04 - -
finetuning BF16 1*A100 0 100 27.60 - -

initial_training image-20220517110624097

finetuning image-20220517115423109

GuoxiaWang avatar May 17 '22 08:05 GuoxiaWang

Reducing the number of recycling iterations will make it faster, but my figures were obtained with the default setting, i.e. with max_recycling_iters = 3 and uniform_recycling=True (that second option randomly chooses a number of recycling iterations between 0 and the maximum, like AlphaFold does during training). How many CPU cores do you have available per GPU? I've noticed that my performance degrades a little when I have fewer CPUs than dataloader workers.

gahdritz avatar May 25 '22 17:05 gahdritz