gradient accumulation preview / wip
I can't seem to get this working tonight, something is off.
The Python part works. i.e. we have the following. Running the default python script reproduces the old behavior before this PR:
python train_gpt2.py --write_tensors=0
This uses T=64, B=4, so per fwdbwd we process 4 * 64 = 256 tokens. So now asking for a total batch size of 512 will calculate the need to do 2 gradient accumulation steps, so we get 256 * 2 = 512 (as desired). So doing:
python train_gpt2.py --write_tensors=0 --total_batch_size=512
Will now do an inner loop of 2. First iteration is only fwd/bwd, and second iteration does fwd/bwd/update. The losses that get printed are (correctly!) identical to the ones above, because our data batch is fixed and identical, so we are simply doing two fwd/bwd on the exact same data, and the same gradient, ending up with the exact same losses, all as expected.
Now, to reproduce Python "vanilla" setup, we can do in llmc:
make train_gpt2cu PRECISION=FP32 && ./train_gpt2cu -b 4 -t 64 -l 1e-4 -v 200 -s 200 -a 1 -x 10 -f 0
This (correctly) reproduces the old behavior and outputs the exact same losses. But now we want to desire batch size 512 by appending the new flag -z 512:
make train_gpt2cu PRECISION=FP32 && ./train_gpt2cu -b 4 -t 64 -l 1e-4 -v 200 -s 200 -a 1 -x 10 -f 0 -z 512
This will correctly calculate that it needs 2 grad accum steps, but the losses come out all wrong. And I'm not sure why. I tried to mirror the approach in both files very closely, including the variable names, etc.
Thinking through:
Issue 1: I noticed that some of the kernels don't correctly += for the parameters. I fixed the few that I found but I'm not sure that I got all of them, and I don't feel confident that my changes are ok, so to speak.
Issue 2: I did not yet test the multi-gpu setup. I don't have confidence that the current implementation handles that correctly.
To be continued...
This isn't the problem you're debugging as it won't even affect the FP32 path (couldn't find anything wrong sadly, and couldn't notice any kernel not doing += for the parameter gradients!) but I noticed we're doing this everywhere:
dbias[i] += (floatX)scratch_dbias[i];
dweight[i] += (floatX)scratch_dweight[i];
That will result in:
- Convert scratch_dbias from FP32 to BF16/FP16
- Do the addition of two BF16/FP16 values with a BF16/FP16 output
- Write BF16/FP16 output to dbias
But what we actually want to do is:
- Convert dbias to FP32
- Do the addition of two FP32 values with a FP32 output
- Round FP32 to BF16/FP16 (either round-to-nearest-even or ideally stochastic rounding)
- Write BF16/FP16 output to dbias
Unfortunately, just removing the (floatX) and making it implicit doesn't help, so we actually need to explicitly cast dbias to FP32:
dbias[i] = (float)dbias[i] + scratch_dbias[i];
dweight[i] = (float)dweight[i] + scratch_dweight[i];
Or even more explicit:
dbias[i] = (floatX)((float)dbias[i] + scratch_dbias[i]);
dweight[i] = (floatX)((float)dweight[i] + scratch_dweight[i]);
This also goes back to what I was saying yesterday about how it would be possible to change Packed128 to always read/write floats rather than the native type with some "very C++" idioms that I really don't love but that would be fully hidden inside the Packed128 class with no impact on the kernels and which would protect us against this kind of thing, and potentially allow us to "hide" stochastic rounding inside Packed128 (although that has its own set of trade-offs). As long as we do the "right thing" manually in every kernel it might not be worth the complexity though.
Anyway for now just doing the latter would be good as a baseline before this gets merged.
Merged alternative PR to this one, closing.