llm.c Implementation of online softmax forward kernel without cooperative groups.

Implementation of online softmax forward kernel without cooperative groups.

Open KarhouTam opened this issue 1 year ago • 0 comments

Following the instructions in #292, I implement online softmax forward kernel that without cooperative groups (compares to softmax_forward_online_kernel2()) and using warp sums and max instead.

About the performance, this implementation is close to the online softmax implementation with cgs and faster than softmax_forward_online_kernel1() and softmax_forward_kernel7(). The former is online softmax without optimizations and the latter is normal softmax with optimizations.

My GPU: 3070 Laptop My OS: Ubuntu 22.04

softmax_forward_online_kernel1():

softmax_forward_kernel7():

softmax_forward_online_kernel2() (with cgs):

This implementation:

May 06 '24 07:05 KarhouTam

llm.c llm.c copied to clipboard

Implementation of online softmax forward kernel without cooperative groups.

llm.c
llm.c copied to clipboard