llm.c
llm.c copied to clipboard
Use proper GeLU on CPU
This change removes the tanh GeLU approximation. This gives us the benefit of better accuracy, roughly equal perf and strict standard conformance, since we no longer need any compiler-specific tricks.
Here's the last lines of train_gpt2 output before this change:
step 37: train loss 3.739647 (took 598.548076 ms)
step 38: train loss 4.611735 (took 596.626145 ms)
step 39: train loss 3.970751 (took 598.439552 ms)
val loss 4.016658
generating:
---
Come Running Away,
Greater conquer
With the Imperial blood
the heaviest host of the gods
into this wondrous world beyond.
I will not back thee, for how sweet after birth
Netflix against repounder,
will not
flourish against the earlocks of
Allay
---
step 40: train loss 4.377756 (took 592.704936 ms)
Here's the last lines of train_gpt2 output after this change:
step 37: train loss 3.731596 (took 594.893995 ms)
step 38: train loss 4.561646 (took 600.064035 ms)
step 39: train loss 3.933512 (took 599.666173 ms)
val loss 4.014135
generating:
---
Whether Hipocrates,
Bigon Nicinius, or rep'd
With Thy fair winter-tail your outraged hand,
The richness of the good smour
Nine years by turns covered my Member. Thou art
Nay, I fear be; but
Lets o' thee know, if it
---
step 40: train loss 4.358461 (took 597.594065 ms)
This change has the disadvantage of diverging from PyTorch. I view this as being justified and worthwhile, for numerous reasons, e.g.
"I used the tanh approximation simply because the error function erf was slow in tensorflow some years ago. If the exact version is fast enough now and does not have numerical issues, I do not see a reason to use an inexact version." ──Quoth Dan Hendrycks
See https://github.com/pytorch/pytorch/issues/39853