Ask-Anything
Ask-Anything copied to clipboard
Using Gradient Accumulation
Hello Thanks for your great work. I need to use gradient accumulation on batches due to RAM constraints. The training loop involves iterating over two modalities. I am concerned about the implications of using gradient accumulation in this scenario. Is it possible and recommended to use gradient accumulation with multiple modalities in an iterator?
with torch.cuda.amp.autocast(enabled=config.fp16):
loss_dict = model(image, text)
loss = sum(loss_dict.values()) / config.accumulate_grad_batches
scaler.scale(loss).backward()
accumulated_batches += 1
if accumulated_batches % config.accumulate_grad_batches == 0:
if config.optimizer.max_grad_norm > 0:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), config.optimizer.max_grad_norm)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad() # Reset gradients only after optimizer step
scheduler.step() # Step the scheduler as per your original strategy
accumulated_batches = 0