llm.c
llm.c copied to clipboard
fix for Error: must forward with targets before backward [#19]
fixes #19
The fix works on my machine running with Ubuntu 2204 & intel 13i cpu, thanks.
little update, the problem seems to be with -Ofast, using -O3 and -Ofast togheter throws the error #19, using only -Ofast also throws the error. With only -O3 it works, on ubuntu 22.04
Can confirm this makes tests pass.
Honestly weird this works. Did you just find a compiler bug in gcc?!
i found out that -Ofast enables a lot of hard optimizations that can alter the behaviour of the program, in particularly it enables the flag -ffast-math, which can mess the floating point arithmetic. So keeping -O3 and -Ofast but disabling just the -ffast-math opt, works
@ent0n29 does this make the code slower for you? I wonder if I should merge to master and get rid of README comment potentially
I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic. I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?
more tests here -> https://github.com/karpathy/llm.c/issues/19#issuecomment-2049099144
I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic. I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?
more tests here -> #19 (comment)
I have collected some data by running OMP_NUM_THREADS=8 ./train_gpt2 on an Apple Silicon M2. Although not being the best benchmark possible, it provides a clear picture of fast-math's contribution.
As a side-note: both the two following instances, have been tested almost under the same general workload of the machine.
Enabling -O3 -Ofast -Wno-unused-result (the default CFLAGS atm) I got the following result:
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.252026
step 0: train loss 5.356190 (took 3303.293000 ms)
step 1: train loss 4.301069 (took 3191.455000 ms)
step 2: train loss 4.623322 (took 2881.769000 ms)
step 3: train loss 4.600470 (took 2835.179000 ms)
step 4: train loss 4.616786 (took 2947.070000 ms)
step 5: train loss 4.231483 (took 2944.147000 ms)
step 6: train loss 3.754234 (took 3088.123000 ms)
step 7: train loss 3.652349 (took 2940.905000 ms)
step 8: train loss 4.183590 (took 2938.175000 ms)
step 9: train loss 4.199315 (took 3039.444000 ms)
val loss 4.425458
By disabling fast-math (with -fno-fast-math) the performances drop, roughly doubling the time for each step:
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.251911
step 0: train loss 5.356084 (took 6769.919000 ms)
step 1: train loss 4.300642 (took 6584.231000 ms)
step 2: train loss 4.623087 (took 6278.325000 ms)
step 3: train loss 4.599361 (took 6234.404000 ms)
step 4: train loss 4.616666 (took 6789.834000 ms)
step 5: train loss 4.231430 (took 7304.708000 ms)
step 6: train loss 3.753160 (took 6959.676000 ms)
step 7: train loss 3.650456 (took 6762.026000 ms)
step 8: train loss 4.182242 (took 7499.524000 ms)
step 9: train loss 4.199580 (took 7104.202000 ms)
val loss 4.426363
Some stats to better evaluate the consistency these measurements:
fast-math:mean := 3010.956 msandstd-dev := 138.254 msno-fast-math:mean := 6828.685 msandstd-dev := 386.642 ms
yes, -Ofast with -ffast-math enabled seems to be the best option, but looks like only works for macos, -Ofast -fno-fast-math is a compromise but should work for everyone.
What about enabling fast-math conditionally, depending on the host OS?
uname -s could provide the mean for detecting the OS.
After some investigation, it seems the error is caused by the gcc "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" combined options introduced by the "-Ofast" / "-ffast-math", and it can be sovled by adding "-fno-finite-math-only" instead of "-fno-fast-math" for a better performance.
Btw, if change cc to clang on my ubuntu 2204, it works well without modifying any other options, and archive the best running speed.
@DongbinNie wow, it's amazing, on my ThinkPad using -Ofast with -fno-finite-math-only (and keep using cc) the performance is almost 2x!!! (from 9309ms to 4966ms per step)
@DongbinNie i can confirm that perfs are 2x with -fno-finite-math-only instead of -fno-fast-math, thanks for the help! I was doing kinda the same thing testing the opt flags enabled by the -Ofast / -ffast-math combo.
@karpathy have you tried compiling with -fno-finite-math-only instead of -fno-fast-math?
Sorry @ent0n29 what is the final recommendation here right now? Is it
CFLAGS = -Ofast -fno-finite-math-only -Wno-unused-result -march=native
yes @karpathy, -fno-finite-math-only instead of -fno-fast-math for almost 2x improvements
I went from ~17 seconds per step with -fno-fast-math, to this with -fno-finite-math-only & -march=native:
step 0: train loss 5.356086 (took 9611.016869 ms)
step 1: train loss 4.300644 (took 8780.770364 ms)
step 2: train loss 4.623083 (took 7137.313333 ms)
step 3: train loss 4.599365 (took 6426.557283 ms)
step 4: train loss 4.616659 (took 6864.495874 ms)
step 5: train loss 4.231428 (took 6635.326672 ms)
step 6: train loss 3.753164 (took 6516.244371 ms)
step 7: train loss 3.650456 (took 6362.145571 ms)
step 8: train loss 4.182243 (took 6333.873539 ms)
step 9: train loss 4.199581 (took 6407.929416 ms)
i close this to open a new one synced