llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

fp16 buffers for ADAM

Open ngc92 opened this issue 1 year ago • 4 comments

First proof-of-concept implementation

ngc92 avatar Apr 29 '24 14:04 ngc92

Instead of having a single scale factor per tensor, we have scales for individual groups of 32. This is less about getting more accuracy (though it might help with that), and more to ensure that we don't need any form of cross-warp communication to handle the scales. I'd expect the group size of 32 to increase once we switch to vertorized adam kernels anyway.

ngc92 avatar Apr 29 '24 15:04 ngc92

I think I'm missing a bit of context on this PR. Is this following some paper / approach?

karpathy avatar Apr 29 '24 16:04 karpathy

It comes from the appendix of "Efficient Large Scale Language Modeling with Mixtures of Experts", which in turn cites "Jukebox: A Generative Model for Music".

However, this is not actually a 1:1 implementation of that. If you want to have one scaling factor per tensor, you need to know inside the adam kernel in which tensor you are (my other draft adam PR). It also requires synchronization, because you need to process the entire tensor, determine the max, scale things accordingly, and write to memory.

Having one scale factor per block requires more memory (though the amount should still be neglible, esp. since I assume the block size will increase when we use vector loads here.

ngc92 avatar Apr 29 '24 17:04 ngc92

rebased on the lastest changes from master. I used #288 to generate a gpt2-large model. Without this patch, training at batch size 1 requires 12658MiB with the fp16 buffers, this goes down to 9892MiB

Sadly, it's not enough to allow me to test the gpt2-xl on my 16GB card, even with batch size one.

ngc92 avatar Apr 29 '24 20:04 ngc92