When using gradient accumulation, does the order of optimizer.zero_grad() affect training?
if I use accelerate+deepspeed to train a model, and I set deepspeed_config: gradient_accumulation_steps: 8 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: false zero_stage: 2
does the order of the order of backward(), step(), zero_grad() affect training?
For example:
for batch in training_dataloader: with accelerator.accumulate(model): inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step() optimizer.zero_grad()
and
for batch in training_dataloader: with accelerator.accumulate(model): optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step()
I want to know whether the two situations will yield the same result. During gradient accumulation training, when the model needs to update the parameters and accelerate.sync_gradients=True, will using the second method clear the gradients, causing the gradient accumulation to be incorrect, so that at this point there is only one sample?
@polestarss Our gradient accumulation is implemented using accelerate, so there's no need to worry—it works as expected. When we call accelerator.backward(loss), the gradients are temporarily stored in GPU memory managed by accelerate and won't be cleared by optimizer.zero_grad(). Here is the doc:
https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation
@polestarss Our gradient accumulation is implemented using
accelerate, so there's no need to worry—it works as expected. When we callaccelerator.backward(loss), the gradients are temporarily stored in GPU memory managed byaccelerateand won't be cleared byoptimizer.zero_grad(). Here is the doc:https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation
ok,thanks!