triton [ROADMAP] Triton-MLIR initial release roadmap

This issue is meant to summarize what needs to be done to the triton-mlir branch before it can be merged into the main branch. It is not exhaustive, and has room to grow.

Frontend:

[x] Define the specs of Triton-IR, our frontend-facing MLIR dialect
[x] Change Triton's ASTVisitor so that it produces Triton-IR code
[x] Ahead-of-time / Kernel launch refactor

Optimizer:

[x] Define the specs of TritonGPU-IR, our optimizer-facing MLIR dialect
[x] Improve layout abstractions to better accomodate BroadcastOp
[x] Implement rewrite patterns for Triton/TritonGPU-IR
[x] Implement the Triton-IR => TritonGPU-IR conversion pass
[x] Implement asynchronous loop prefetching pass
[x] Implement a pass that determines contiguity/constancy/divisibility info about tensor elements
[x] Implement memory coalescing pass
[x] Implement layout conversion simplification pass
[ ] Implement matmul slicing optimization
[ ] Implement re-association pass for add/getelementptr to better leverage immediate offsets on nvidia GPUs

Backend:

[x] Shared memory allocation
[x] Shared memory barrier placement

LLVM code-gen:

[x] Index calculation for blocked_layout
- [ ] more UTs for corner case verification, higher ranks, reversed order etc.
[x] Basic op support, Load/Store, GEP, Splat, Constant, Elementwise, Broadcast
[x] VecAdd correctness verified in python e2e flow

Remaining TODOs of Load/StoreOp:

[x] Refactoring of LoadOp with PtxInstr abstraction
[x] vectorization support with AxisInfo
- [ ] mask support in load/store vectorization (ongoing)
[x] gep + load/store fold optimization
[ ] verification of L1 eviction policy for load/store (lower)

Shared_layout related:

[x] Shared memory initialization in TritonGPUToLLVM from the results of Allocation/Alias
[ ] ConvertLayoutOp support (higher priority)
- [x] blocked -> blocked
- [ ] blocked -> shared / shared -> blocked (high)
- [ ] blocked -> mma / mma -> blocked (high)
[ ] sliced_layout & transpose kernel (higher priority) (ongoing, almost done)
[ ] alloc_tensor, update_slice, extract_slice support, double_buffer + N_buffer (lower) (high)
[ ] swizzle (lower)

mma_layout related:

[ ] Codegen for dot (high)

Completeness of op coverage

[ ] Elementwise Ops
[ ] Reduce Ops (ongoing)

Sep 12 '22 19:09 ptillet

excited to see the new MLIR backend. Does the todo item "Codegen for dot (high)" imply matmul is not working yet?

Sep 15 '22 00:09 yuguo68

Yes that's correct. It will take some time, but we wanted to open-source what we have so far so people interested in non-nvidia backend could start looking at the Triton dialects.

Sep 15 '22 00:09 ptillet

I think the MLIR rewrite is officially complete 🥳 Closing this

Feb 23 '23 04:02 ptillet

triton triton copied to clipboard

[ROADMAP] Triton-MLIR initial release roadmap

Frontend:

Optimizer:

Backend:

triton
triton copied to clipboard