composable_kernel
composable_kernel copied to clipboard
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
### Problem Description to debug `02_gemm_add_add_fastgelu` with client api, I tried to enable arg.Print() under Invoker:;Run() as following: ```c++ // Invoker struct Invoker : public BaseInvoker { using Argument =...
Add new `fmha_fwd_appendkv()` API which runs ahead the `fmha_fwd()`/`fmha_fwd_splitkv()` API. The `fmha_fwd_appendkv()` + `fmha_fwd()`/`fmha_fwd_splitkv()` combination implement the functionality of `mha_fwd_kvcache()` in FA 2.5 (without paged-kvcache part)
This will reduce the size of binaries built with compilers ROCm6.2+ by at least 50%.
This will help prevent CI pipeline crashes due to nodes running out of disc space.
Added structural sparsity blockwise gemm
Enabled bf16 atomic_add on MI300
* This PR is to generate the mha static lib from generate.py