llm.c Reduce the duplicate computation in crossentropy and include the V padding for crossentropy kernel

Reduce the duplicate computation in crossentropy and include the V padding for crossentropy kernel

Open lancerts opened this issue 10 months ago • 3 comments

introduce const int bt = b * T + t to reduce the duplicated computation. [Running 5+ trials on 3070, no obvious improvement in the performance.]
Include the V padding for crossentropy kernel (kernel 2)

With V_pad = 50304 and V = 50257, on 3070

Overall, the performance of kernels 1 & 2 is noisy and on par with fluctuations in runtime across different runs.

Apr 15 '24 17:04 lancerts

Isn't the if necessary for safety?

Apr 15 '24 19:04 karpathy

Isn't the if necessary for safety?

Correct, too focused on the thread divergence and forgot about the basics : ) Updated the PR

Apr 15 '24 19:04 lancerts

Include a kernel taking padded probs.

Apr 15 '24 21:04 lancerts

This code is not used anymore due to fused classifier pr

Apr 20 '24 05:04 karpathy