llm.c
llm.c copied to clipboard
Reduce the duplicate computation in crossentropy and include the V padding for crossentropy kernel
- introduce
const int bt = b * T + t
to reduce the duplicated computation. [Running 5+ trials on 3070, no obvious improvement in the performance.] - Include the V padding for crossentropy kernel (kernel 2)
With V_pad = 50304 and V = 50257, on 3070
Kernel 1 block_size 32 | time 0.0073 ms | per token 0.90 ns block_size 64 | time 0.0069 ms | per token 0.84 ns block_size 128 | time 0.0080 ms | per token 0.97 ns block_size 256 | time 0.0069 ms | per token 0.84 ns block_size 512 | time 0.0067 ms | per token 0.82 ns block_size 1024 | time 0.0079 ms | per token 0.96 ns
Kernel 2 block_size 32 | time 0.0062 ms | per token 0.75 ns block_size 64 | time 0.0075 ms | per token 0.92 ns block_size 128 | time 0.0070 ms | per token 0.86 ns block_size 256 | time 0.0075 ms | per token 0.92 ns block_size 512 | time 0.0069 ms | per token 0.84 ns block_size 1024 | time 0.0069 ms | per token 0.84 ns
Overall, the performance of kernels 1 & 2 is noisy and on par with fluctuations in runtime across different runs.
Isn't the if
necessary for safety?
Isn't the
if
necessary for safety?
Correct, too focused on the thread divergence and forgot about the basics : ) Updated the PR
Include a kernel taking padded probs.
This code is not used anymore due to fused classifier pr