triton icon indicating copy to clipboard operation
triton copied to clipboard

[ROADMAP] Triton-MLIR initial release roadmap

Open ptillet opened this issue 2 years ago • 2 comments

This issue is meant to summarize what needs to be done to the triton-mlir branch before it can be merged into the main branch. It is not exhaustive, and has room to grow.

Frontend:

  • [x] Define the specs of Triton-IR, our frontend-facing MLIR dialect
  • [x] Change Triton's ASTVisitor so that it produces Triton-IR code
  • [x] Ahead-of-time / Kernel launch refactor

Optimizer:

  • [x] Define the specs of TritonGPU-IR, our optimizer-facing MLIR dialect
  • [x] Improve layout abstractions to better accomodate BroadcastOp
  • [x] Implement rewrite patterns for Triton/TritonGPU-IR
  • [x] Implement the Triton-IR => TritonGPU-IR conversion pass
  • [x] Implement asynchronous loop prefetching pass
  • [x] Implement a pass that determines contiguity/constancy/divisibility info about tensor elements
  • [x] Implement memory coalescing pass
  • [x] Implement layout conversion simplification pass
  • [ ] Implement matmul slicing optimization
  • [ ] Implement re-association pass for add/getelementptr to better leverage immediate offsets on nvidia GPUs

Backend:

  • [x] Shared memory allocation
  • [x] Shared memory barrier placement

LLVM code-gen:

  • [x] Index calculation for blocked_layout
    • [ ] more UTs for corner case verification, higher ranks, reversed order etc.
  • [x] Basic op support, Load/Store, GEP, Splat, Constant, Elementwise, Broadcast
  • [x] VecAdd correctness verified in python e2e flow

Remaining TODOs of Load/StoreOp:

  • [x] Refactoring of LoadOp with PtxInstr abstraction
  • [x] vectorization support with AxisInfo
    • [ ] mask support in load/store vectorization (ongoing)
  • [x] gep + load/store fold optimization
  • [ ] verification of L1 eviction policy for load/store (lower)

Shared_layout related:

  • [x] Shared memory initialization in TritonGPUToLLVM from the results of Allocation/Alias
  • [ ] ConvertLayoutOp support (higher priority)
    • [x] blocked -> blocked
    • [ ] blocked -> shared / shared -> blocked (high)
    • [ ] blocked -> mma / mma -> blocked (high)
  • [ ] sliced_layout & transpose kernel (higher priority) (ongoing, almost done)
  • [ ] alloc_tensor, update_slice, extract_slice support, double_buffer + N_buffer (lower) (high)
  • [ ] swizzle (lower)

mma_layout related:

  • [ ] Codegen for dot (high)

Completeness of op coverage

  • [ ] Elementwise Ops
  • [ ] Reduce Ops (ongoing)

ptillet avatar Sep 12 '22 19:09 ptillet

excited to see the new MLIR backend. Does the todo item "Codegen for dot (high)" imply matmul is not working yet?

yuguo68 avatar Sep 15 '22 00:09 yuguo68

Yes that's correct. It will take some time, but we wanted to open-source what we have so far so people interested in non-nvidia backend could start looking at the Triton dialects.

ptillet avatar Sep 15 '22 00:09 ptillet

I think the MLIR rewrite is officially complete 🥳 Closing this

ptillet avatar Feb 23 '23 04:02 ptillet