DiffSynth-Studio When using gradient accumulation, does the order of optimizer.zero

if I use accelerate+deepspeed to train a model, and I set deepspeed_config: gradient_accumulation_steps: 8 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: false zero_stage: 2

does the order of the order of backward(), step(), zero_grad() affect training? For example: for batch in training_dataloader: with accelerator.accumulate(model): inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step() optimizer.zero_grad()

and for batch in training_dataloader: with accelerator.accumulate(model): optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step()

I want to know whether the two situations will yield the same result. During gradient accumulation training, when the model needs to update the parameters and accelerate.sync_gradients=True, will using the second method clear the gradients, causing the gradient accumulation to be incorrect, so that at this point there is only one sample?

Nov 10 '25 03:11 polestarss

@polestarss Our gradient accumulation is implemented using accelerate, so there's no need to worry—it works as expected. When we call accelerator.backward(loss), the gradients are temporarily stored in GPU memory managed by accelerate and won't be cleared by optimizer.zero_grad(). Here is the doc:

https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation

Nov 11 '25 04:11 Artiprocher

@polestarss Our gradient accumulation is implemented using accelerate, so there's no need to worry—it works as expected. When we call accelerator.backward(loss), the gradients are temporarily stored in GPU memory managed by accelerate and won't be cleared by optimizer.zero_grad(). Here is the doc:

https://huggingface.co/docs/accelerate/usage_guides/gradient_accumulation

ok，thanks！

Nov 11 '25 06:11 polestarss

When using gradient accumulation, does the order of optimizer.zero_grad() affect training?