openfold About Initial Training and Finetune Training Speed

Environment:

pytorch-lightning==1.5.10 torch==1.10.0+cu113 deepspeed==0.5.9 CUDA Version: 11.2.142 Driver Version: 460.32.03 CUDNN: 8.20.00 [torch.backends.cudnn.version(), 8200] NCCL: 2.10.3 [torch.cuda.nccl.version(), (2, 10, 3)]

git commit 24118e804535515398e77010e11b1312f6ff1b62 & Date: Fri May 13 13:59:38 2022 -0400

Install command:

git clone https://github.com/aqlaboratory/openfold.git
python setup.py develop

Training Scripts:

# --config_preset initial_training or finetuning
# only one sample in training dataset
# note: template_mmcif_dir is equal to mmcif_dir
python train_openfold.py data/mmcif/ data/alignment/ data/mmcif/ output_dir \
	2021-10-10 \
   --config_preset finetuning \
   --template_release_dates_cache_path data/mmcif_cache.json \
   --precision bf16 \
   --deepspeed_config_path deepspeed_config.json \
   --gpus 1 \
   --seed 42 \
   --checkpoint_every_epoch \
   --train_chain_data_cache_path data/chain_data_cache.json \
   --train_mapping_path data/protein_id.mapping

The performance results:

config_preset	DType	GPUs(40G)	Step	Time (s/step)	Min Memory	Max Memory
initial_training	BF16	1*A100	1000	10.28	5162MiB / 40536MiB	10960MiB / 40536MiB
finetuning	BF16	1*A100	340	40.82	28040MiB / 40536MiB	39628MiB / 40536MiB

initial_training Screenshot:

finetuning Screenshot:

Can you confirm if the speed is right? I

May 16 '22 14:05 GuoxiaWang

These times are a lot slower than what I'm seeing on our A100s. I'm getting ~8.9 seconds per initial_training iteration during our current 40 GPU training run. Single-GPU times are faster, closer to 8 seconds per iteration. Single-GPU finetuning iterations take about 20-25 seconds each.

I'm effectively using the same command as you, just with the --benchmark flag, which runs CUDNN benchmarking, purportedly speeding up the network a little.

I have

pytorch-lightning==1.5.10 torch==1.10.1+cu113 deepspeed==0.5.9 CUDA Version: 11.4 Driver Version: 470.82.01 CUDNN: 8.2.4.15 NCCL: 2.10.3

Others have reported faster times still; see e.g. https://arxiv.org/pdf/2203.00854.pdf, where they claim times of 6.2s/20.7s for the two respective config settings on 128 A100s. A separate third party has also reported distributed A100 times between 6 and 7 seconds to me. I've personally been unable to reach such times.

Some questions:

~~Are you using the most recent OpenFold commit?~~ (just noticed you gave a hash)
How many dataloader workers are you using? Data processing seems to be the closest thing to a bottleneck for us.
If you have the time, could you run some simple profiles? E.g. the forward pass time, the time for generating each batch, the default torch profiler.

May 16 '22 19:05 gahdritz

@gahdritz

Answer the question: 2. I use the default config, so I use num_workers=16. I try to save a new_batch.pickle and use data_module = DummyDataLoader("new_batch.pickle") and get the same speed. 3. I am newer for openfold, I think I need some time to add profile code. If I product profiler files and paste to here.

I try to reproduce your performance "closer to 8 seconds per iteration for initial_training and 20-25 seconds per iteration for finetuning on single-GPU.

When I set max_recycling_iters=0 in config.py to disable recycle. I got the closer results. So If you set max_recycling_iters=0 to test the performance?

config_preset	DType	GPUs(40G)	max_recycling_iters	Step	Time (s/step)	Min Memory	Max Memory
initial_training	BF16	1*A100	3	1000	10.28	5162MiB / 40536MiB	10960MiB / 40536MiB
finetuning	BF16	1*A100	3	340	40.82	28040MiB / 40536MiB	39628MiB / 40536MiB
initial_training	BF16	1*A100	0	100	8.04	-	-
finetuning	BF16	1*A100	0	100	27.60	-	-

initial_training

finetuning

May 17 '22 08:05 GuoxiaWang

Reducing the number of recycling iterations will make it faster, but my figures were obtained with the default setting, i.e. with max_recycling_iters = 3 and uniform_recycling=True (that second option randomly chooses a number of recycling iterations between 0 and the maximum, like AlphaFold does during training). How many CPU cores do you have available per GPU? I've noticed that my performance degrades a little when I have fewer CPUs than dataloader workers.

May 25 '22 17:05 gahdritz

openfold openfold copied to clipboard

About Initial Training and Finetune Training Speed

openfold
openfold copied to clipboard