Ask-Anything Using Gradient Accumulation

Using Gradient Accumulation

Open bexxnaz opened this issue 1 month ago • 1 comments

Hello Thanks for your great work. I need to use gradient accumulation on batches due to RAM constraints. The training loop involves iterating over two modalities. I am concerned about the implications of using gradient accumulation in this scenario. Is it possible and recommended to use gradient accumulation with multiple modalities in an iterator?

 with torch.cuda.amp.autocast(enabled=config.fp16):
            loss_dict = model(image, text)
            loss = sum(loss_dict.values()) / config.accumulate_grad_batches  
        scaler.scale(loss).backward()
        accumulated_batches += 1
        if accumulated_batches % config.accumulate_grad_batches == 0:
            if config.optimizer.max_grad_norm > 0:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), config.optimizer.max_grad_norm)
            
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()  # Reset gradients only after optimizer step
            scheduler.step()  # Step the scheduler as per your original strategy
            accumulated_batches = 0

May 14 '24 10:05 bexxnaz

Ask-Anything Ask-Anything copied to clipboard

Using Gradient Accumulation

Ask-Anything
Ask-Anything copied to clipboard