llm.c only save missing bits to reconstruct fp32 master weights

only save missing bits to reconstruct fp32 master weights

Open ngc92 opened this issue 1 year ago • 2 comments

I think I managed to get the bit-fiddling right, and this will effectively give us fp31 master parameters at the cost of only 16 additional bits (instead of the current 32).

Before merging, the code really needs to be critically reviewed, I'm especially unsure about the stochastic rounding code (why do we have two calls to the RNG for each rounding operation?).

May 19 '24 13:05 ngc92

I assume two calls are due to the fact that we don't want each thread in the kernel to do stochastic rounding with the same seed.

At least that was the idea, but we'll end up having same seed eitherway because it's pseudorandom. :')

Thus I'd say it's likely a bug.

EDIT: actually the unique position of the thread (threadIdx.x and blockIdx.x) inside stochastic rounding modulate the Get2dNoiseUint so seed can remain constant. The other external Get2dNoiseUint call also includes Y information which helps strided kernels have unique rounding seed.

Actually just see this: https://github.com/karpathy/llm.c/pull/597

Jun 15 '24 08:06 gordicaleksa

@ngc92 - this is the other PR you were referring to for testing?

Jun 18 '24 07:06 rosslwheeler

llm.c llm.c copied to clipboard

only save missing bits to reconstruct fp32 master weights

llm.c
llm.c copied to clipboard