nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Add gradient accumulation support

Open VHellendoorn opened this issue 2 years ago • 2 comments

Enables training with larger effective batch sizes by taking multiple steps between gradient updates. I've always found this useful since batch size correlates strongly with performance even for small models (per Kaplan'20). Feel free to decline, of course.

VHellendoorn avatar Jan 09 '23 20:01 VHellendoorn

Ty I've been meaning to look into adding this, will review shortly.

karpathy avatar Jan 10 '23 01:01 karpathy

Sounds good. FWIW, I just noticed that this PR messes with the printed loss here because each loss term is normalized. One obvious fix is to scale that loss back up by gradient_accumulation_steps, but that can make it a pretty noisy estimate if the micro-batch size is small. What I'd normally do is average metrics like this over several batches (e.g., storing the sum of loss.item() in a variable and resetting after every log_interval steps), but that might defeat the purpose of being lightweight & fast in this case. Happy to push either solution.

VHellendoorn avatar Jan 10 '23 18:01 VHellendoorn

I changed things around a bit wrt semantics and implementation and ended up here: https://github.com/karpathy/nanoGPT/commit/cf9991488629b1b072c49bf261d04b0c8a3207a3

any thoughts? ty for opening the PR.

karpathy avatar Jan 15 '23 17:01 karpathy

Great, glad I could help! Minor note: I realized on my own end that the number of eval steps is also affected by this, in that it now refers to microbatches instead of minibatches. That shouldn't matter much for the default config (eval_iters = 200), but since I typically use larger fewer, larger batches, I ended up just multiplying eval_iters * gradient_accumulation_steps here to get a more stable loss.

VHellendoorn avatar Jan 15 '23 17:01 VHellendoorn

ok! just for the record, some rough timings I saw:

  • simple 1GPU training: 350ms
  • DDP 4GPU training: 520ms (quite a bit of overhead from DDP is ~1.5X? :( ... )
  • grad accum 10 steps without the use of no_sync(): 5200ms (i.e. 10X, as expected)
  • grad accum 10 steps and use of no_sync(): 3500ms (i.e. around 6.7X of DDP time, so we save quite a bit of time by syncing just once on the last step, and approach the original 1GPU time, DDP overhead going towards zero on per step basis, as expected)

This was just to confirm that no_sync() was doing something, the way I implemented it by toggling the internal variable. @VHellendoorn note that your original PR did not have it, just a heads up if you have your own fork you're working with.

karpathy avatar Jan 15 '23 18:01 karpathy

This is useful info! I hadn't used DDP yet (training a sweep of smaller models instead), but it's nice that the sync overhead becomes neglible with more accumulation steps. I tend to see the same with multi-node training of larger models (e.g. on Megatron) -- it's almost embarrassingly parallel as long as each node spends at least 10-20s doing their own stuff.

VHellendoorn avatar Jan 15 '23 18:01 VHellendoorn