llm.c
llm.c copied to clipboard
only save missing bits to reconstruct fp32 master weights
I think I managed to get the bit-fiddling right, and this will effectively give us fp31 master parameters at the cost of only 16 additional bits (instead of the current 32).
Before merging, the code really needs to be critically reviewed, I'm especially unsure about the stochastic rounding code (why do we have two calls to the RNG for each rounding operation?).
I assume two calls are due to the fact that we don't want each thread in the kernel to do stochastic rounding with the same seed.
At least that was the idea, but we'll end up having same seed eitherway because it's pseudorandom. :')
Thus I'd say it's likely a bug.
EDIT: actually the unique position of the thread (threadIdx.x and blockIdx.x) inside stochastic rounding modulate the Get2dNoiseUint so seed can remain constant. The other external Get2dNoiseUint call also includes Y information which helps strided kernels have unique rounding seed.
Actually just see this: https://github.com/karpathy/llm.c/pull/597
@ngc92 - this is the other PR you were referring to for testing?