fp16 buffers for ADAM
First proof-of-concept implementation
Instead of having a single scale factor per tensor, we have scales for individual groups of 32. This is less about getting more accuracy (though it might help with that), and more to ensure that we don't need any form of cross-warp communication to handle the scales. I'd expect the group size of 32 to increase once we switch to vertorized adam kernels anyway.
I think I'm missing a bit of context on this PR. Is this following some paper / approach?
It comes from the appendix of "Efficient Large Scale Language Modeling with Mixtures of Experts", which in turn cites "Jukebox: A Generative Model for Music".
However, this is not actually a 1:1 implementation of that. If you want to have one scaling factor per tensor, you need to know inside the adam kernel in which tensor you are (my other draft adam PR). It also requires synchronization, because you need to process the entire tensor, determine the max, scale things accordingly, and write to memory.
Having one scale factor per block requires more memory (though the amount should still be neglible, esp. since I assume the block size will increase when we use vector loads here.
rebased on the lastest changes from master.
I used #288 to generate a gpt2-large model.
Without this patch, training at batch size 1 requires 12658MiB
with the fp16 buffers, this goes down to 9892MiB
Sadly, it's not enough to allow me to test the gpt2-xl on my 16GB card, even with batch size one.