llm.c Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu

More targeted flag optimizations for gcc.

It's the tanhf function in gelu_backwards that causes the model to fail with -ffast-math on gcc on Linux.

Before:

$  grep name /proc/cpuinfo |head -1                                                                                                                            
model name      : Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz

step 0: train loss 5.356086 (took 6167.853384 ms)
step 1: train loss 4.300644 (took 5460.413776 ms)
step 2: train loss 4.623082 (took 5276.372294 ms)
vs
step 0: train loss 5.356185 (took 5714.622339 ms)
step 1: train loss 4.301033 (took 4814.820671 ms)
step 2: train loss 4.623316 (took 4813.711103 ms)

$  grep name /proc/cpuinfo |head -1                                                                                                                            
model name      : AMD Ryzen 5 3600 6-Core Processor

step 0: train loss 5.356085 (took 3397.901288 ms)                                                                                                                                   
step 1: train loss 4.300644 (took 2810.743621 ms)                                                                                                                                   
step 2: train loss 4.623083 (took 2813.287769 ms) 
vs
step 0: train loss 5.356185 (took 2639.362407 ms)
step 1: train loss 4.301032 (took 2258.179942 ms)
step 2: train loss 4.623315 (took 2261.548428 ms)

Timings obtained with:

( kill -STOP -1  # Stop all processes, NB don't run this outside a script!
timeout 40s ./train_gpt2
kill -CONT -1 )

Also noted:

~$ gcc -Ofast -Q --help=optimizers|grep enabled > a
~$ gcc -O3 -Ofast -Q --help=optimizers|grep enabled > b
~$ diff a b

Apr 17 '24 22:04 dagelf

Also resolves #19 for good I think

Apr 17 '24 22:04 dagelf

So maybe this is ok to merge... 1 it looks a little funny is there no way to combine the double nested if into one condition? 2 i think a comment explaining this would go a long way

Apr 18 '24 03:04 karpathy

Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.

#pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on )

This is really ugly. I know.

Apr 18 '24 04:04 azret

My issue with adding pragma's to source files (OpenMP excluded) is that you will keep adding more per platform/compiler. One suggestion was to split this function off into its own file then you can use the Makefile to compile with whatever flags are suitable for the platform/compiler. Makefile's typically have platform dependencies in them. It might be easier from a maintenance standpoint be to keep the source code as clean as possible?

Apr 18 '24 05:04 rosslwheeler

@dagelf i knew we could still go further with the cpu, thanks! looking into it

Apr 18 '24 06:04 ent0n29

So maybe this is ok to merge... 1 it looks a little funny is there no way to combine the double nested if into one condition? 2 i think a comment explaining this would go a long way

yes, you can write this @dagelf:

#if defined(__GNUC__) && !defined(__clang__)
    __attribute__((optimize("no-finite-math-only"))) 
#endif

Apr 18 '24 07:04 ent0n29

@karpathy ifdefs squashed and comment added

Apr 18 '24 15:04 dagelf

Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.

#pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on )

This is really ugly. I know.

Does it bug out on MSVC with -Ofast too?

Apr 18 '24 18:04 dagelf

Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization. #pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on ) This is really ugly. I know.

Does it bug out on MSVC with -Ofast too?

yep

Apr 18 '24 22:04 azret

Tested to work with and speed up msvc too.

Apr 19 '24 07:04 dagelf

I'm sorry this is too weird and ugly to merge I think. Can someone try alternative strategies? For example tanh can be written as a function of exp quite trivially, maybe calling it that way makes it ok?

Apr 20 '24 05:04 karpathy

Tried that, will need to do both tanhf and expf, busy with the latter... but it might be even uglier ...It's really the msvc part that makes it ugly IMHO 😄

Simply adding:

 __attribute__((optimize("no-finite-math-only")))

Fixes it for gcc. clang always works, but is slow.

msvc needs the pragmas before and after. The #ifdefs are just there to eliminate warnings for foreign pragmas when compiling.

Apr 20 '24 14:04 dagelf

For now I'm just going to remove the ifdefs to get this down to only two lines, to keep it clean.

Going down the route of performant custom math functions means breaking cross platform compatibility, unless we start exploring lookup tables for CPU inference. Which I will explore next.

There sure is more performance to be gained. I quickly realized that a faster activation function might lead to slower convergence and more training steps, negating the benefits. This is my cue to learn more about what makes the activation function work so that I can develop a better intuition for it. (Any pointers appreciated!)

2024-04-20-211309_725x602_scrot

For the record, it's actually the exponential in the coshf that has the biggest influence on whatever makes gelu_backward break the model. Looking that the activation function graphs above, I think I can see why :smile:

If anybody else wants to explore platform specific math function optimizations, here is a good start: https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

Before playing with lookup tables, I'll compare performance of different activation functions.

Apr 20 '24 19:04 dagelf

Lookup tables are a great idea

Apr 20 '24 19:04 azret

Some more reference materials...

Activation functions

https://web.archive.org/web/20230324020355/https://chadrick-kwag.net/relu-gelu-swish-mish-activation-function-comparison/ https://web.archive.org/web/20220619144443/https://arxiv.org/vc/arxiv/papers/1908/1908.08681v2.pdf https://consensus.app/results/?q=neural%20activation%20function

Math function optimizations

https://zenodo.org/records/4685966 https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu https://news.ycombinator.com/item?id=8828936 https://libc.llvm.org/math/ https://dl.acm.org/doi/fullHtml/10.1145/3624062.3624166 https://stackoverflow.com/questions/47025373/fastest-implementation-of-the-natural-exponential-function-using-sse https://stackoverflow.com/questions/9799041/efficient-implementation-of-natural-logarithm-ln-and-exponentiation

Update:

Lookup tables are a great idea

So, just tried this. Turns out it might not be such a great idea... modern CPU's are weird! So far they're slower, unless I make them really small. Also, this function hardly gets called. Pushed the lookup table code to my repo.

Apr 20 '24 22:04 dagelf

llm.c
llm.c copied to clipboard

Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards

Activation functions

Math function optimizations

llm.c llm.c copied to clipboard

Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards

Activation functions

Math function optimizations

llm.c
llm.c copied to clipboard