composable_kernel
composable_kernel copied to clipboard
Initial implementation of block sparse FMHA
I have an example implementation of a block-sparse attention kernel that builds on top of the existing fmha_fwd example. Current performance sees roughly a 2.22x speedup at a 4K sequence length and a 15.54x speedup at a 32K sequence length (using 128x128 tile size). The attention mask used is the same as the one described in the BigBird paper.
Currently marking as a draft because I need to integrate my some of my supporting data structures into what CK already has.
Very cool work @cameronshinn I've sure people in the community will be able to use it!
If you haven't seen it already we have a discord community studying all things ML libraries and kernels https://github.com/cuda-mode (P.S. we do all hardware)
Outdated. New PR here https://github.com/ROCm/composable_kernel/pull/1340.