nanoGPT
nanoGPT copied to clipboard
Add gradient accumulation support
Enables training with larger effective batch sizes by taking multiple steps between gradient updates. I've always found this useful since batch size correlates strongly with performance even for small models (per Kaplan'20). Feel free to decline, of course.
Ty I've been meaning to look into adding this, will review shortly.
Sounds good. FWIW, I just noticed that this PR messes with the printed loss here because each loss term is normalized.
One obvious fix is to scale that loss back up by gradient_accumulation_steps, but that can make it a pretty noisy estimate if the micro-batch size is small. What I'd normally do is average metrics like this over several batches (e.g., storing the sum of loss.item() in a variable and resetting after every log_interval steps), but that might defeat the purpose of being lightweight & fast in this case. Happy to push either solution.
I changed things around a bit wrt semantics and implementation and ended up here: https://github.com/karpathy/nanoGPT/commit/cf9991488629b1b072c49bf261d04b0c8a3207a3
any thoughts? ty for opening the PR.
Great, glad I could help! Minor note: I realized on my own end that the number of eval steps is also affected by this, in that it now refers to microbatches instead of minibatches. That shouldn't matter much for the default config (eval_iters = 200), but since I typically use larger fewer, larger batches, I ended up just multiplying eval_iters * gradient_accumulation_steps here to get a more stable loss.
ok! just for the record, some rough timings I saw:
- simple 1GPU training: 350ms
- DDP 4GPU training: 520ms (quite a bit of overhead from DDP is ~1.5X? :( ... )
- grad accum 10 steps without the use of
no_sync(): 5200ms (i.e. 10X, as expected) - grad accum 10 steps and use of
no_sync(): 3500ms (i.e. around 6.7X of DDP time, so we save quite a bit of time by syncing just once on the last step, and approach the original 1GPU time, DDP overhead going towards zero on per step basis, as expected)
This was just to confirm that no_sync() was doing something, the way I implemented it by toggling the internal variable. @VHellendoorn note that your original PR did not have it, just a heads up if you have your own fork you're working with.
This is useful info! I hadn't used DDP yet (training a sweep of smaller models instead), but it's nice that the sync overhead becomes neglible with more accumulation steps. I tend to see the same with multi-node training of larger models (e.g. on Megatron) -- it's almost embarrassingly parallel as long as each node spends at least 10-20s doing their own stuff.