composable_kernel
composable_kernel copied to clipboard
Extend support for ak1 / bk1 WMMA
Proposed changes
Summary:
- Add support for
AK1 != BK1 - Add support for
AK1, BK1 > 8 - Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction
Scenarios now supported:
| A Type | B Type | AK1 | BK1 | A Load Vmem | A Store LDS | A Load LDS | B Load Vmem | B Store LDS | B Load LDS | KInner |
|---|---|---|---|---|---|---|---|---|---|---|
| FP16 | FP16 | 2 | 8 | 32 bit | 32 bit | 32 bit | 128 bit | 128 bit | 128 bit | 1 |
| FP8 | FP8 | 16 | 16 | 128 bit | 128 bit | 128 bit | 128 bit | 128 bit | 128 bit | 2 |
| FP8 | FP16 | 16 | 8 | 128 bit | 128 bit | 64 bit | 128 bit | 128 bit | 128 bit | 1 |
Checklist
Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant after this pull request
- [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
- [x] I have run
clang-formaton all changed files - [x] Any dependent changes have been merged
Discussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered