llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

fix for Error: must forward with targets before backward [#19]

Open ent0n29 opened this issue 1 year ago • 13 comments

fixes #19

ent0n29 avatar Apr 09 '24 09:04 ent0n29

The fix works on my machine running with Ubuntu 2204 & intel 13i cpu, thanks.

DongbinNie avatar Apr 09 '24 13:04 DongbinNie

little update, the problem seems to be with -Ofast, using -O3 and -Ofast togheter throws the error #19, using only -Ofast also throws the error. With only -O3 it works, on ubuntu 22.04

ent0n29 avatar Apr 09 '24 13:04 ent0n29

Can confirm this makes tests pass.

chsasank avatar Apr 09 '24 16:04 chsasank

Honestly weird this works. Did you just find a compiler bug in gcc?!

chsasank avatar Apr 09 '24 16:04 chsasank

i found out that -Ofast enables a lot of hard optimizations that can alter the behaviour of the program, in particularly it enables the flag -ffast-math, which can mess the floating point arithmetic. So keeping -O3 and -Ofast but disabling just the -ffast-math opt, works

ent0n29 avatar Apr 09 '24 16:04 ent0n29

@ent0n29 does this make the code slower for you? I wonder if I should merge to master and get rid of README comment potentially

karpathy avatar Apr 10 '24 19:04 karpathy

I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic. I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?

more tests here -> https://github.com/karpathy/llm.c/issues/19#issuecomment-2049099144

ent0n29 avatar Apr 10 '24 19:04 ent0n29

I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic. I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?

more tests here -> #19 (comment)

I have collected some data by running OMP_NUM_THREADS=8 ./train_gpt2 on an Apple Silicon M2. Although not being the best benchmark possible, it provides a clear picture of fast-math's contribution.

As a side-note: both the two following instances, have been tested almost under the same general workload of the machine.

Enabling -O3 -Ofast -Wno-unused-result (the default CFLAGS atm) I got the following result:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.252026
step 0: train loss 5.356190 (took 3303.293000 ms)
step 1: train loss 4.301069 (took 3191.455000 ms)
step 2: train loss 4.623322 (took 2881.769000 ms)
step 3: train loss 4.600470 (took 2835.179000 ms)
step 4: train loss 4.616786 (took 2947.070000 ms)
step 5: train loss 4.231483 (took 2944.147000 ms)
step 6: train loss 3.754234 (took 3088.123000 ms)
step 7: train loss 3.652349 (took 2940.905000 ms)
step 8: train loss 4.183590 (took 2938.175000 ms)
step 9: train loss 4.199315 (took 3039.444000 ms)
val loss 4.425458

By disabling fast-math (with -fno-fast-math) the performances drop, roughly doubling the time for each step:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.251911
step 0: train loss 5.356084 (took 6769.919000 ms)
step 1: train loss 4.300642 (took 6584.231000 ms)
step 2: train loss 4.623087 (took 6278.325000 ms)
step 3: train loss 4.599361 (took 6234.404000 ms)
step 4: train loss 4.616666 (took 6789.834000 ms)
step 5: train loss 4.231430 (took 7304.708000 ms)
step 6: train loss 3.753160 (took 6959.676000 ms)
step 7: train loss 3.650456 (took 6762.026000 ms)
step 8: train loss 4.182242 (took 7499.524000 ms)
step 9: train loss 4.199580 (took 7104.202000 ms)
val loss 4.426363

Some stats to better evaluate the consistency these measurements:

  • fast-math: mean := 3010.956 ms and std-dev := 138.254 ms
  • no-fast-math: mean := 6828.685 ms and std-dev := 386.642 ms

mc-cat-tty avatar Apr 11 '24 16:04 mc-cat-tty

yes, -Ofast with -ffast-math enabled seems to be the best option, but looks like only works for macos, -Ofast -fno-fast-math is a compromise but should work for everyone.

ent0n29 avatar Apr 11 '24 16:04 ent0n29

What about enabling fast-math conditionally, depending on the host OS?

uname -s could provide the mean for detecting the OS.

mc-cat-tty avatar Apr 11 '24 16:04 mc-cat-tty

After some investigation, it seems the error is caused by the gcc "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" combined options introduced by the "-Ofast" / "-ffast-math", and it can be sovled by adding "-fno-finite-math-only" instead of "-fno-fast-math" for a better performance.

Btw, if change cc to clang on my ubuntu 2204, it works well without modifying any other options, and archive the best running speed.

DongbinNie avatar Apr 12 '24 10:04 DongbinNie

@DongbinNie wow, it's amazing, on my ThinkPad using -Ofast with -fno-finite-math-only (and keep using cc) the performance is almost 2x!!! (from 9309ms to 4966ms per step)

bexcite avatar Apr 12 '24 16:04 bexcite

@DongbinNie i can confirm that perfs are 2x with -fno-finite-math-only instead of -fno-fast-math, thanks for the help! I was doing kinda the same thing testing the opt flags enabled by the -Ofast / -ffast-math combo.

ent0n29 avatar Apr 12 '24 17:04 ent0n29

@karpathy have you tried compiling with -fno-finite-math-only instead of -fno-fast-math?

ent0n29 avatar Apr 15 '24 17:04 ent0n29

Sorry @ent0n29 what is the final recommendation here right now? Is it

CFLAGS = -Ofast -fno-finite-math-only -Wno-unused-result -march=native

karpathy avatar Apr 15 '24 22:04 karpathy

yes @karpathy, -fno-finite-math-only instead of -fno-fast-math for almost 2x improvements

ent0n29 avatar Apr 16 '24 08:04 ent0n29

I went from ~17 seconds per step with -fno-fast-math, to this with -fno-finite-math-only & -march=native:

step 0: train loss 5.356086 (took 9611.016869 ms)
step 1: train loss 4.300644 (took 8780.770364 ms)
step 2: train loss 4.623083 (took 7137.313333 ms)
step 3: train loss 4.599365 (took 6426.557283 ms)
step 4: train loss 4.616659 (took 6864.495874 ms)
step 5: train loss 4.231428 (took 6635.326672 ms)
step 6: train loss 3.753164 (took 6516.244371 ms)
step 7: train loss 3.650456 (took 6362.145571 ms)
step 8: train loss 4.182243 (took 6333.873539 ms)
step 9: train loss 4.199581 (took 6407.929416 ms)

ent0n29 avatar Apr 16 '24 08:04 ent0n29

i close this to open a new one synced

ent0n29 avatar Apr 16 '24 10:04 ent0n29