llm.c
llm.c copied to clipboard
Implementation of online softmax forward kernel without cooperative groups.
Following the instructions in #292, I implement online softmax forward kernel that without cooperative groups (compares to softmax_forward_online_kernel2()) and using warp sums and max instead.
About the performance, this implementation is close to the online softmax implementation with cgs and faster than softmax_forward_online_kernel1() and softmax_forward_kernel7(). The former is online softmax without optimizations and the latter is normal softmax with optimizations.
My GPU: 3070 Laptop My OS: Ubuntu 22.04
softmax_forward_online_kernel1():
softmax_forward_kernel7():
softmax_forward_online_kernel2() (with cgs):
This implementation: