Shao Tang
Shao Tang
Reduce the duplicate computation in crossentropy and include the V padding for crossentropy kernel
- introduce `const int bt = b * T + t` to reduce the duplicated computation. [Running 5+ trials on 3070, no obvious improvement in the performance.] - Include the...
A larger `thread_reuse_factor` reduces the number of threads launched while increasing the per-thread load. Depending on the value of `B * T * OC` and the GPU card, it is...
With master branch ``` ❯ nvcc -O3 --use_fast_math crossentropy_softmax_backward.cu -o crossentropy_softmax_backward ❯ ./crossentropy_softmax_backward 1 Using kernel 1 Checking block size 32. -0.438539 -0.438539 0.136152 0.136152 -0.364946 -0.364946 -0.384722 -0.384722 -0.530659...
https://github.com/karpathy/llm.c/issues/147
https://github.com/karpathy/llm.c/issues/147 script execution normally after fix
Makefile type fix: move gelu_backward to backward block
No notice of change in performance after the changing from float4 to pack128.
On 3070, Kernel 2 time gpu 0.0799 ms time cpu 0.0168 ms Kernel 3 time gpu 0.0780 ms time cpu 0.0166 ms