llm.c Refactoring & Improvements to reduce LOC

Refactoring and removing unused functions to reduce the number of lines of code and make everything slightly more consistent (while still having space for the code to breathe).

Also updates encoder_backward with my version from more_stochastic branch in order to delete atomicAddX() from the codebase, while improving accuracy very slightly by using stochastic rounding (it's literally the only place in the entire code where we are not accumulating in FP32!)

And changes the Makefile so that it compiles for the user's specific GPU when on Linux with nvidia-smi (generating both PTX and SASS for the binary so cuobjdump works to check the assembly).

May 05 '24 01:05 ademeure

All looks good happy to merge.

A few stray cudaCheck(cudaGetLastError());
A dropped print of enable_tf32 (?)
CI failed for fp16 is this expected?

May 05 '24 11:05 karpathy

* CI failed for fp16 is this expected?

Nope that was a mistake, I didn't include a FP16 version of the new atomicStochasticAdd - I fixed it to use templates, which also let me get rid of "__bfloat1622float2" which afaik was the only missing functionality on older versions of CUDA, so train_gpt2.cu should (hopefully) now compile on CUDA 10/11 as well!

May 05 '24 14:05 ademeure