triton icon indicating copy to clipboard operation
triton copied to clipboard

Performance and Prospects: MLIR-Generated CUDA vs CUTLASS/TensorRT?

Open shingjan opened this issue 2 years ago • 5 comments

Hi team,

I formulated the question while reviewing the latest advancements in OpenAI's Triton, particularly the transition to using MLIR for replacing LLVM PTX and adding support for Nvidia Hopper. I'm curious about your opinions regarding the decision of choosing MLIR over other CUDA libraries, e.g. CUTLASS/TensorRT, instead of handwritten CUDA kernels.

I'm interested in understanding the performance implications of MLIR-generated CUDA code compared to alternatives like CUTLASS, TensorRT, or even solutions such as TVM's custom CUDA code generation. From what I've gathered, preliminary results from projects like AITemplate highlight CUTLASS as exceptionally efficient. Additionally, there's the well-established TensorRT, which has historically been a top-performing option in the CUDA ecosystem.

Given my lack of familiarity with MLIR and the CUDA code it's capable of producing, I'm curious about the motivation behind the decision to rewrite with MLIR. In the pull request for the rewrite, @ptillet mentioned the potential for "ultimate performance" with MLIR. Does this suggest that MLIR-generated CUDA code could potentially surpass CUTLASS and TensorRT in specific workloads? It would be incredibly helpful if you could provide some context and delve into this topic further. Thank you in advance for your insights!

shingjan avatar Aug 18 '23 00:08 shingjan

Does this suggest that MLIR-generated CUDA code could potentially surpass CUTLASS and TensorRT in specific workloads

Yes, at least when the frontend is Triton

Jokeren avatar Aug 18 '23 01:08 Jokeren

@Jokeren Thanks for the reply! Does that mean you guys tried different cuda codegen path for triton, like MLIR/CUTLASS/TensorRT and CUDA kernels generated by MLIR yield the best result? Would that be possible that some insights on this can be shared, e.g. what workload are tests, what is the different between generated cuda code that lead to the perf gap and what the perf number look like? Would really appreciate that!

shingjan avatar Aug 18 '23 20:08 shingjan

@shingjan did you find an answer to your question and/or data affirmatively supporting @Jokeren statement ? if so, would be great if you can share.

MLIR is just a tool to build a compiler, with enough efforts whether you're using Cutlass or anything else does not matter: all of these end up generating NVVM/PTX.

The question is about the effort you need to put to get the right PTX generated for all the use-case you care about. Cutlass has very strong foundation to scale the generation of kernels across a wide range of needs, and this provides an excellent reference.

However while this is a great library, the needs of a compiler powering a DSL like Triton are a bit different, and Cutlass isn't necessarily always the most practical option to build such a compiler. MLIR provides on the other an excellent foundation for building such a tool, and you can then take inspiration of the Cutlass recipes as a blueprint to build your MLIR compiler.

joker-eph avatar May 07 '24 06:05 joker-eph

Indeed, yet Effort * Experience is not free. And I can't agree more on "take inspiration of the Cutlass recipes as a blueprint to build your MLIR compiler". So.. the proof is in the pudding (i) What inspiration MLIR takes from CUTLASS ? (ii) is there any intersting MLP dimensions benchmark showing relative performance - say Ampere where both are are stable (iii) BTW, Is there any compare and contrast between Triton IR approach and Graphene IR which does seem to take some inspiration (at least) from cuTe? Nvidia’s Graphene IR approach https://www.youtube.com/watch?v=wtpxe5RREjQ And the paper. https://dl.acm.org/doi/pdf/10.1145/3582016.3582018