llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Implementation of online softmax forward kernel without cooperative groups.

Open KarhouTam opened this issue 1 year ago • 0 comments

Following the instructions in #292, I implement online softmax forward kernel that without cooperative groups (compares to softmax_forward_online_kernel2()) and using warp sums and max instead.

About the performance, this implementation is close to the online softmax implementation with cgs and faster than softmax_forward_online_kernel1() and softmax_forward_kernel7(). The former is online softmax without optimizations and the latter is normal softmax with optimizations.

My GPU: 3070 Laptop My OS: Ubuntu 22.04

softmax_forward_online_kernel1(): image

softmax_forward_kernel7(): image

softmax_forward_online_kernel2() (with cgs): image

This implementation: image

KarhouTam avatar May 06 '24 07:05 KarhouTam