openfold Distributed training and `max_recycling

Distributed training and `max_recycling_iters`

Open pujaltes opened this issue 4 months ago • 0 comments

I have a question regarding the number of recycling iterations used during training. In the AF2 paper they mention that the number of recycling iterations are a "shared value across the batch". However, from what I can tell batch level attributes during distributed training are actually defined at the micro-batch level here:

https://github.com/aqlaboratory/openfold/blob/ef0c9face788001b1624b3b2dbfa951072841835/openfold/data/data_modules.py#L800-L836

From my understanding, in both DDP and DeepSpeed each batch is split into micro-batches that are each sent to one GPU. The issue is that the batch splitting occurs in the DistributedSampler before it even gets to the OpenFoldDataLoader. Ergo, all these properties that should be fixed at the batch-level are actually defined at the micro-batch level, meaning that each GPU process could be running a different number of recycling iterations. Please let me know if I am reading this incorrectly, but apart from not matching the paper wouldn't this be extremely wasteful as all GPUs would have to wait for the micro-batch with the largest recycling_iters?

For DDP we could simply use the broadcast api to send the recycling_iters from rank 0 to the rest of the processes. Looking at the DeepSpeedStrategy code from lightning it seems that it inherits the DDPStrategy class, along with the broadcast method. The inherited method is actually used throughout the DeepSpeedStrategy class so we should be fine to use it for both distributed training strategies.

Thanks for your help in advance :)

Mar 06 '24 18:03 pujaltes

openfold openfold copied to clipboard

Distributed training and `max_recycling_iters`

openfold
openfold copied to clipboard