llm.c
llm.c copied to clipboard
[todo] Accumulate in double instead of float
I have not kept good hygiene on using double
for accumulates everywhere that you have local register variables. Accumulate should be in double
, and then read/write in float
, todo to fix everywhere, part by part. Doing this most likely results in much better precision/accuracy at near zero cost of additional compute/memory.
Example accumulate that should have been a double instead of float:
float m = 0.0f;
for (int i = 0; i < C; i++) {
m += x[i];
}
m = m/C;
Are you talking about these warnings? train_gpt2.c(232,17): warning C4244: 'initializing': conversion from 'double' to 'float', possible loss of data train_gpt2.c(304,31): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data train_gpt2.c(304,17): warning C4244: 'initializing': conversion from 'double' to 'float', possible loss of data train_gpt2.c(362,43): warning C4305: 'argument': truncation from 'double' to 'float' train_gpt2.c(370,26): warning C4305: 'argument': truncation from 'double' to 'float' train_gpt2.c(374,77): warning C4305: 'argument': truncation from 'double' to 'float' train_gpt2.c(934,47): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data train_gpt2.c(935,47): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data train_gpt2.cu(728): warning C4244: 'argument': conversion from 'int' to 'float', possible loss of data train_gpt2.cu(728): warning C4244: 'initializing': conversion from 'double' to 'float', possible loss of data
Kahan summation is well used technique to reduce numerical error in summation computation
https://en.m.wikipedia.org/wiki/Kahan_summation_algorithm
Its rather simple to implement.
Putting in here a summary of the discussion of the result of this CR: https://github.com/karpathy/llm.c/pull/221 There is a signifigant slowdown on the order of magnitude of 32x for using doubles compared to floats on gaming gpu's and by using doubles instead of floats for the summation in this location led to a 2x slowdown even though the variance in error was reduced. Will have to be done on a case by case basis looking at the tradeoff of CPU vs Memory bound kernels.