llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Reduce the duplicate computation in crossentropy and include the V padding for crossentropy kernel

Open lancerts opened this issue 10 months ago • 3 comments

  • introduce const int bt = b * T + t to reduce the duplicated computation. [Running 5+ trials on 3070, no obvious improvement in the performance.]
  • Include the V padding for crossentropy kernel (kernel 2)

With V_pad = 50304 and V = 50257, on 3070

Kernel 1 block_size 32 | time 0.0073 ms | per token 0.90 ns block_size 64 | time 0.0069 ms | per token 0.84 ns block_size 128 | time 0.0080 ms | per token 0.97 ns block_size 256 | time 0.0069 ms | per token 0.84 ns block_size 512 | time 0.0067 ms | per token 0.82 ns block_size 1024 | time 0.0079 ms | per token 0.96 ns

Kernel 2 block_size 32 | time 0.0062 ms | per token 0.75 ns block_size 64 | time 0.0075 ms | per token 0.92 ns block_size 128 | time 0.0070 ms | per token 0.86 ns block_size 256 | time 0.0075 ms | per token 0.92 ns block_size 512 | time 0.0069 ms | per token 0.84 ns block_size 1024 | time 0.0069 ms | per token 0.84 ns

Overall, the performance of kernels 1 & 2 is noisy and on par with fluctuations in runtime across different runs.

lancerts avatar Apr 15 '24 17:04 lancerts

Isn't the if necessary for safety?

karpathy avatar Apr 15 '24 19:04 karpathy

Isn't the if necessary for safety?

Correct, too focused on the thread divergence and forgot about the basics : ) Updated the PR

lancerts avatar Apr 15 '24 19:04 lancerts

Include a kernel taking padded probs.

lancerts avatar Apr 15 '24 21:04 lancerts

This code is not used anymore due to fused classifier pr

karpathy avatar Apr 20 '24 05:04 karpathy