composable_kernel Initial implementation of block sparse FMHA

Initial implementation of block sparse FMHA

Open cameronshinn opened this issue 11 months ago • 1 comments

I have an example implementation of a block-sparse attention kernel that builds on top of the existing fmha_fwd example. Current performance sees roughly a 2.22x speedup at a 4K sequence length and a 15.54x speedup at a 32K sequence length (using 128x128 tile size). The attention mask used is the same as the one described in the BigBird paper.

Currently marking as a draft because I need to integrate my some of my supporting data structures into what CK already has.

Mar 22 '24 20:03 cameronshinn

Very cool work @cameronshinn I've sure people in the community will be able to use it!

If you haven't seen it already we have a discord community studying all things ML libraries and kernels https://github.com/cuda-mode (P.S. we do all hardware)

Apr 01 '24 10:04 Iron-Bound

Outdated. New PR here https://github.com/ROCm/composable_kernel/pull/1340.

Jun 14 '24 07:06 cameronshinn

composable_kernel composable_kernel copied to clipboard

Initial implementation of block sparse FMHA

composable_kernel
composable_kernel copied to clipboard