composable_kernel
composable_kernel copied to clipboard
Hstu attention n0loop fused unroll pr
This PR brings an implementation of HSTU attention on ck_tile. HSTU attention is very different from the fmha implemented in ck_tile, for details, please refer to the hstu paper
The implementation is well verified on MI300 for both functionalities and targeted performance, but it does not make any optimization for MI350.
To build
#> cd build; ../scripts/cmake-ck-dev.sh .. gfx942; make -j 128 tile_example_hstu_attention
To verify
#> . examples/ck_tile/23_hstu_attention/scripts/test_hstu_attention.sh
The codes of HSTU are all located under the folder examples/ck_tile/23_hstu_attention, but this PR also made some tiny change to the core ck_tile codes under include/ck_tile/core/tensor