triton
triton copied to clipboard
[WIP] Optimize fma dot
The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.
This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.
Rough list of PRs:
- [ ] Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot #4516
- [ ] Blocked->dotOp shared memory bypassing #4538
- [ ] Accelerate AMD Matmul + emit dot operations #4594
- [ ] Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS (this PR) #4581
- [ ] Vectorization optimization of dot operands/results (in case llvm can not do this internally)
- [ ] Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) #4559