triton [WIP] [AMD] Emit AMD specific intrinsics for dot

[WIP] [AMD] Emit AMD specific intrinsics for dot

Open binarman opened this issue 5 months ago • 1 comments

This PR:

Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases
Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

[ ] Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot #4516
[ ] Blocked->dotOp shared memory bypassing #4538
[ ] Accelerate AMD Matmul + emit dot operations (this PR) #4594
[ ] Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS #4581
[ ] Vectorization optimization of dot operands/results (in case llvm can not do this internally)
[ ] Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) #4559

Aug 28 '24 19:08 binarman