llm.c
llm.c copied to clipboard
Include the online softmax CPU code and a fully parallelized GPU kernal
- Include the online softmax CPU code (from the paper Online normalizer calculation for softmax).
- Its native port to GPU kernel
kernel 5
(for education comparison). - Include the fully parallel kernel
kernel 6
, where the reduction op uses
and cooperative group (from @ngc92 ) .
RTX3070:
B=8, T=1024
Kernal 4:
Kernel 6:
B=64, T=1024
Kernal 4:
Kernal 6:
What is the benefit of the online softmax for us?
What is the benefit of the online softmax for us?
- It reduces the 3 for loops (1 compute max, 2 compute sum, 3 compute output) to 2 for loops (1 compute max and sum in one loop, 2 compute output). Therefore, it is an algorithmic improvement for faster computation.
- It is used in flash attention 1&2 as the building block in the derivation.
Need to implement the corresponding optimization used in kernel 2-4 for the online softmax as follow-ups...
Updated with a full parallel kernel.
On my A100 I am seeing:
kernel 4:
block_size 32 | time 0.221143 ms
block_size 64 | time 0.096894 ms
block_size 128 | time 0.069505 ms
block_size 256 | time 0.066208 ms
block_size 512 | time 0.076968 ms
block_size 1024 | time 0.116055 ms
kernel 6:
block_size 32 | time 0.101643 ms
block_size 64 | time 0.102056 ms
block_size 128 | time 0.085717 ms
block_size 256 | time 0.084556 ms
block_size 512 | time 0.085275 ms
block_size 1024 | time 0.085103 ms
so yes as you mentioned in Discord the kernel is more consistent over block size, but maybe not as fast provided you choose the right block size.