llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Include the online softmax CPU code and a fully parallelized GPU kernal

Open lancerts opened this issue 10 months ago • 4 comments

  1. Include the online softmax CPU code (from the paper Online normalizer calculation for softmax).
  2. Its native port to GPU kernel kernel 5 (for education comparison).
  3. Include the fully parallel kernel kernel 6, where the reduction op uses

image

and cooperative group (from @ngc92 ) .

RTX3070:

B=8, T=1024 Kernal 4: image

Kernel 6: image

B=64, T=1024 Kernal 4: image

Kernal 6: image

lancerts avatar Apr 11 '24 16:04 lancerts

What is the benefit of the online softmax for us?

karpathy avatar Apr 11 '24 17:04 karpathy

What is the benefit of the online softmax for us?

  • It reduces the 3 for loops (1 compute max, 2 compute sum, 3 compute output) to 2 for loops (1 compute max and sum in one loop, 2 compute output). Therefore, it is an algorithmic improvement for faster computation.
  • It is used in flash attention 1&2 as the building block in the derivation.

Need to implement the corresponding optimization used in kernel 2-4 for the online softmax as follow-ups...

lancerts avatar Apr 11 '24 17:04 lancerts

Updated with a full parallel kernel.

lancerts avatar Apr 11 '24 22:04 lancerts

On my A100 I am seeing:

kernel 4:

block_size   32 | time 0.221143 ms
block_size   64 | time 0.096894 ms
block_size  128 | time 0.069505 ms
block_size  256 | time 0.066208 ms
block_size  512 | time 0.076968 ms
block_size 1024 | time 0.116055 ms

kernel 6:

block_size   32 | time 0.101643 ms
block_size   64 | time 0.102056 ms
block_size  128 | time 0.085717 ms
block_size  256 | time 0.084556 ms
block_size  512 | time 0.085275 ms
block_size 1024 | time 0.085103 ms

so yes as you mentioned in Discord the kernel is more consistent over block size, but maybe not as fast provided you choose the right block size.

karpathy avatar Apr 13 '24 01:04 karpathy