Ilya V
Ilya V
-Add WMMA layout to TritonGPU dialect -Support required methods for it Please note, lowering to WMMA instructions is not supported yet.
This commit fixes failure in python/tutorials/03-matrix-multiplication.py for FMA cases.
Convert operands to fp16 and apply fp16 wmma instruction. Add lit test.
- Added intrinsic generation according to the operands type, cache them to avoid repetitive calculations - Fixed parameters dependent on the version in the main logic of WMMA operation generator...
- Provided required arguments to store operation - Added testcase to test_core.py::test_store_cache_modifier - Skip gfx11 arch in cache modifiers load/store tests Current mapping is following: Loads: ca(default) - cache at...
- Generated intrinsic for wmma calculations - Generate tied instructions along M axis if possible. Results for FA benchmark (from [here](https://github.com/jfactory07/flash-attention-gfx11.git)) for gfx11 (W7900) target:  Thanks @jfactory07 for the...