triton icon indicating copy to clipboard operation
triton copied to clipboard

Hoist reduction outside a loop

Open binarman opened this issue 6 months ago • 6 comments

This PR introduces an optimization that hoists reduction operation of dot accumulator outside a loop over K dimension:

%acc = for k tiles: %acc3d_input = reshape %acc %acc3d_out = dot3d(%x, %y, %acc3d_input) %acc = reduction batch %acc3d_out

transforms to:

%acc3d = for k tiles: %acc3d = dot3d(%x, %y, %acc3d) %acc = reduction batch %acc3d

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

  • [ ] Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot #4516
  • [ ] Blocked->dotOp shared memory bypassing #4538
  • [ ] Accelerate AMD Matmul + emit dot operations #4594
  • [ ] Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS #4581
  • [ ] Vectorization optimization of dot operands/results (in case llvm can not do this internally)
  • [ ] Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) (this PR) #4559

binarman avatar Aug 22 '24 20:08 binarman