Ilya V

Results 6 issues of Ilya V

-Add WMMA layout to TritonGPU dialect -Support required methods for it Please note, lowering to WMMA instructions is not supported yet.

This commit fixes failure in python/tutorials/03-matrix-multiplication.py for FMA cases.

Convert operands to fp16 and apply fp16 wmma instruction. Add lit test.

- Added intrinsic generation according to the operands type, cache them to avoid repetitive calculations - Fixed parameters dependent on the version in the main logic of WMMA operation generator...

- Provided required arguments to store operation - Added testcase to test_core.py::test_store_cache_modifier - Skip gfx11 arch in cache modifiers load/store tests Current mapping is following: Loads: ca(default) - cache at...

- Generated intrinsic for wmma calculations - Generate tied instructions along M axis if possible. Results for FA benchmark (from [here](https://github.com/jfactory07/flash-attention-gfx11.git)) for gfx11 (W7900) target: ![image](https://github.com/user-attachments/assets/71de4f25-9a05-48ea-8a37-2d8f69d5e48a) Thanks @jfactory07 for the...