accelerate
accelerate copied to clipboard
How to handle gradient accumulation with multiple models ?
To do gradient accumulation with accelerate
we wrap the model in accelerator.accumulate
context. But what would be the right way to achieve this when multiple models are involved ?
For example, when training latent diffusion models we have 3 separate models, a vae, text encoder and a unet, as you can see in this script. Of which only the text_encoder is being trained (but could also train others as well).
The obvious way to do this would be to create a wrapper model, but curious to know if this can be achieved without using the wrapper model.
cc @muellerzr
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Re-opening this issue again. For doing grad accum with accelerator.accumulate
with two models (both are being trained) can we use two context managers like this
with accumulate(model1) as _, with accumulate(model2) as _:
training_step()
Re-opening this issue again. For doing grad accum with
accelerator.accumulate
with two models (both are being trained) can we use two context managers like thiswith accumulate(model1) as _, with accumulate(model2) as _: training_step()
Does this currently work? Or is that a feature request, meaning it currently wouldn't work, but would work in the future? Sorry, I got a bit confused by the feature request tag and your comment.
Very interested in this. I'm training two models at once and can only use batch sizes of less than 5 on my machine... So gradient accumulation would be great
I "solved" it by creating one Accelerator
per model. If you use only one and register the models via accelerator.prepare(model1, ..., modelN)
at least one of the models is not learning anything. This might be a bug.
As I think, in this case writing accumulation by yourself maybe more flexible. Accelerator.accumulate()
is not necessary. Just write code like:
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma.
This is pretty simple actually, just use "with accelerator.accumulate(model1), accelerator.accumulate(model2): " this is the mechanism of "with", the following code will be in this two contexts, so just simply put them together with comma.
This apparently is not working. I printed AdamW statistics of parameter groups from different models, and one of them will go out of sync between GPUs with this setup, which from my point of view should not happen in DDP.
Quoting from torch
forum:
That said, if a single call to backward involves gradient accumulation for more than 1 DDP wrapped module, then you’ll have to use a different process group for each of them to avoid interference.
Format:
Sync [GPU ID]
Min Max Mean of parameter group exp_sq_avg.sqrt()
...
After wrapping the models in a SuperModel module it no longer goes async.
TL; DR: don't do gradient accumulation with multiple models. Wrap them in a wrapper model and do accelerator stuff with it. Move relevant forward logic inside the wrapper model.
Edit: creating an accelerator for each model as @LvanderGoten suggests could also work. Personally I prefer the wrapper model.
@eliphatfs would you please show your solution (wrapping models together) in a sudo code? I am working on training controlnet + SD modules together.
Basically, if you have this in your main training loop:
states = text_encoder(input_ids)
pred = unet(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)
# now loss.backward() will corrupt gradients if you are using accumulation on multi-gpu
Change it into:
class SuperModel(nn.Module):
def __init__(self, unet: UNet2DConditionModel, text_encoder: nn.Module) -> None:
super().__init__()
self.unet = unet
self.text_encoder = text_encoder
def forward(self, input_ids, noisy_latents, timesteps):
states = text_encoder(input_ids)
return unet(noisy_latents, states, timesteps)
When constructing models, construct a SuperModel
after you do with the modules.
When accelerator.prepare
, only do it on the SuperModel
. Same with optimizer and clip grad norm (or may be these are not important).
And in the main loop replace the two lines with a single call to SuperModel
forward:
pred = supermodel(noisy_latents, states, timesteps)
loss = F.mse_loss(pred, targets)
You may also need to change the final saving:
supermodel: SuperModel = accelerator.unwrap_model(supermodel)
supermodel.text_encoder.save_pretrained(os.path.join(args.output_dir, 'text_encoder'))
supermodel.unet.save_pretrained(os.path.join(args.output_dir, 'unet'))
I have not yet a good idea how to do with LoRA layers yet. It seems that LoRA layers on multiple modules are causing more problems since only AttnProcLayers
get prepare
-d.
I "solved" it by creating one
Accelerator
per model. If you use only one and register the models viaaccelerator.prepare(model1, ..., modelN)
at least one of the models is not learning anything. This might be a bug.
You mean create two accelerator objects and use nested accumulate for training loop?
with accel_1.accumulate(model1):
with accel_2.accumulate(model2):
training_steps
Does it now support gradient accumulation for multiple models?
Does it now support gradient accumulation for multiple models?
I think #1708 should fix it according to comment.
Can we use gradient accumulation for multiple models in distributed training?
Yes, just wrap them all in the accumulate function as shown in the earlier PR linked
As I think, in this case writing accumulation by yourself maybe more flexible.
Accelerator.accumulate()
is not necessary. Just write code like:loss = loss / gradient_accumulation_steps accelerator.backward(loss) if (index+1) % gradient_accumulation_steps == 0: optimizer.step() scheduler.step() optimizer.zero_grad()
don't forget to delete with accelerator.accumulate(unet): and Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, guys