Results 33 issues of Shao Tang

- introduce `const int bt = b * T + t` to reduce the duplicated computation. [Running 5+ trials on 3070, no obvious improvement in the performance.] - Include the...

A larger `thread_reuse_factor` reduces the number of threads launched while increasing the per-thread load. Depending on the value of `B * T * OC` and the GPU card, it is...

With master branch ``` ❯ nvcc -O3 --use_fast_math crossentropy_softmax_backward.cu -o crossentropy_softmax_backward ❯ ./crossentropy_softmax_backward 1 Using kernel 1 Checking block size 32. -0.438539 -0.438539 0.136152 0.136152 -0.364946 -0.364946 -0.384722 -0.384722 -0.530659...

https://github.com/karpathy/llm.c/issues/147

https://github.com/karpathy/llm.c/issues/147 script execution normally after fix

Makefile type fix: move gelu_backward to backward block

No notice of change in performance after the changing from float4 to pack128.

On 3070, Kernel 2 time gpu 0.0799 ms time cpu 0.0168 ms Kernel 3 time gpu 0.0780 ms time cpu 0.0166 ms