triton
triton copied to clipboard
[ROADMAP] Triton-MLIR initial release roadmap
This issue is meant to summarize what needs to be done to the triton-mlir
branch before it can be merged into the main branch. It is not exhaustive, and has room to grow.
Frontend:
- [x] Define the specs of Triton-IR, our frontend-facing MLIR dialect
- [x] Change Triton's ASTVisitor so that it produces Triton-IR code
- [x] Ahead-of-time / Kernel launch refactor
Optimizer:
- [x] Define the specs of TritonGPU-IR, our optimizer-facing MLIR dialect
- [x] Improve layout abstractions to better accomodate BroadcastOp
- [x] Implement rewrite patterns for Triton/TritonGPU-IR
- [x] Implement the Triton-IR => TritonGPU-IR conversion pass
- [x] Implement asynchronous loop prefetching pass
- [x] Implement a pass that determines contiguity/constancy/divisibility info about tensor elements
- [x] Implement memory coalescing pass
- [x] Implement layout conversion simplification pass
- [ ] Implement matmul slicing optimization
- [ ] Implement re-association pass for add/getelementptr to better leverage immediate offsets on nvidia GPUs
Backend:
- [x] Shared memory allocation
- [x] Shared memory barrier placement
LLVM code-gen:
- [x] Index calculation for blocked_layout
- [ ] more UTs for corner case verification, higher ranks, reversed order etc.
- [x] Basic op support, Load/Store, GEP, Splat, Constant, Elementwise, Broadcast
- [x] VecAdd correctness verified in python e2e flow
Remaining TODOs of Load/StoreOp:
- [x] Refactoring of LoadOp with PtxInstr abstraction
- [x] vectorization support with AxisInfo
- [ ] mask support in load/store vectorization (ongoing)
- [x] gep + load/store fold optimization
- [ ] verification of L1 eviction policy for load/store (lower)
Shared_layout related:
- [x] Shared memory initialization in TritonGPUToLLVM from the results of Allocation/Alias
- [ ] ConvertLayoutOp support (higher priority)
- [x] blocked -> blocked
- [ ] blocked -> shared / shared -> blocked (high)
- [ ] blocked -> mma / mma -> blocked (high)
- [ ] sliced_layout & transpose kernel (higher priority) (ongoing, almost done)
- [ ] alloc_tensor, update_slice, extract_slice support, double_buffer + N_buffer (lower) (high)
- [ ] swizzle (lower)
mma_layout related:
- [ ] Codegen for dot (high)
Completeness of op coverage
- [ ] Elementwise Ops
- [ ] Reduce Ops (ongoing)
excited to see the new MLIR backend. Does the todo item "Codegen for dot (high)" imply matmul is not working yet?
Yes that's correct. It will take some time, but we wanted to open-source what we have so far so people interested in non-nvidia backend could start looking at the Triton dialects.
I think the MLIR rewrite is officially complete 🥳 Closing this