llm.c
llm.c copied to clipboard
Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards
More targeted flag optimizations for gcc
.
It's the tanhf
function in gelu_backwards
that causes the model to fail with -ffast-math
on gcc
on Linux.
Before:
$ grep name /proc/cpuinfo |head -1
model name : Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz
step 0: train loss 5.356086 (took 6167.853384 ms)
step 1: train loss 4.300644 (took 5460.413776 ms)
step 2: train loss 4.623082 (took 5276.372294 ms)
vs
step 0: train loss 5.356185 (took 5714.622339 ms)
step 1: train loss 4.301033 (took 4814.820671 ms)
step 2: train loss 4.623316 (took 4813.711103 ms)
$ grep name /proc/cpuinfo |head -1
model name : AMD Ryzen 5 3600 6-Core Processor
step 0: train loss 5.356085 (took 3397.901288 ms)
step 1: train loss 4.300644 (took 2810.743621 ms)
step 2: train loss 4.623083 (took 2813.287769 ms)
vs
step 0: train loss 5.356185 (took 2639.362407 ms)
step 1: train loss 4.301032 (took 2258.179942 ms)
step 2: train loss 4.623315 (took 2261.548428 ms)
Timings obtained with:
( kill -STOP -1 # Stop all processes, NB don't run this outside a script!
timeout 40s ./train_gpt2
kill -CONT -1 )
Also noted:
~$ gcc -Ofast -Q --help=optimizers|grep enabled > a
~$ gcc -O3 -Ofast -Q --help=optimizers|grep enabled > b
~$ diff a b
Also resolves #19 for good I think
So maybe this is ok to merge... 1 it looks a little funny is there no way to combine the double nested if into one condition? 2 i think a comment explaining this would go a long way
Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.
#pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on )
This is really ugly. I know.
My issue with adding pragma's to source files (OpenMP excluded) is that you will keep adding more per platform/compiler. One suggestion was to split this function off into its own file then you can use the Makefile to compile with whatever flags are suitable for the platform/compiler. Makefile's typically have platform dependencies in them. It might be easier from a maintenance standpoint be to keep the source code as clean as possible?
@dagelf i knew we could still go further with the cpu, thanks! looking into it
So maybe this is ok to merge... 1 it looks a little funny is there no way to combine the double nested if into one condition? 2 i think a comment explaining this would go a long way
yes, you can write this @dagelf:
#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("no-finite-math-only")))
#endif
@karpathy ifdefs squashed and comment added
Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.
#pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on )
This is really ugly. I know.
Does it bug out on MSVC with -Ofast too?
Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization. #pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on ) This is really ugly. I know.
Does it bug out on MSVC with -Ofast too?
yep
Tested to work with and speed up msvc
too.
I'm sorry this is too weird and ugly to merge I think. Can someone try alternative strategies? For example tanh can be written as a function of exp quite trivially, maybe calling it that way makes it ok?
Tried that, will need to do both tanhf
and expf
, busy with the latter... but it might be even uglier ...It's really the msvc
part that makes it ugly IMHO 😄
Simply adding:
__attribute__((optimize("no-finite-math-only")))
Fixes it for gcc
. clang
always works, but is slow.
msvc
needs the pragmas
before and after. The #ifdefs
are just there to eliminate warnings for foreign pragmas when compiling.
For now I'm just going to remove the ifdefs
to get this down to only two lines, to keep it clean.
Going down the route of performant custom math functions means breaking cross platform compatibility, unless we start exploring lookup tables for CPU inference. Which I will explore next.
There sure is more performance to be gained. I quickly realized that a faster activation function might lead to slower convergence and more training steps, negating the benefits. This is my cue to learn more about what makes the activation function work so that I can develop a better intuition for it. (Any pointers appreciated!)
For the record, it's actually the exponential in the coshf
that has the biggest influence on whatever makes gelu_backward
break the model. Looking that the activation function graphs above, I think I can see why :smile:
If anybody else wants to explore platform specific math function optimizations, here is a good start: https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu
Before playing with lookup tables, I'll compare performance of different activation functions.
Lookup tables are a great idea
Some more reference materials...
Activation functions
https://web.archive.org/web/20230324020355/https://chadrick-kwag.net/relu-gelu-swish-mish-activation-function-comparison/ https://web.archive.org/web/20220619144443/https://arxiv.org/vc/arxiv/papers/1908/1908.08681v2.pdf https://consensus.app/results/?q=neural%20activation%20function
Math function optimizations
https://zenodo.org/records/4685966 https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu https://news.ycombinator.com/item?id=8828936 https://libc.llvm.org/math/ https://dl.acm.org/doi/fullHtml/10.1145/3624062.3624166 https://stackoverflow.com/questions/47025373/fastest-implementation-of-the-natural-exponential-function-using-sse https://stackoverflow.com/questions/9799041/efficient-implementation-of-natural-logarithm-ln-and-exponentiation
Update:
Lookup tables are a great idea
So, just tried this. Turns out it might not be such a great idea... modern CPU's are weird! So far they're slower, unless I make them really small. Also, this function hardly gets called. Pushed the lookup table code to my repo.