[DOC]: Could you add the speed comparion with Triton/ThunderKittens?
How would you describe the priority of this documentation request?
Critical (currently preventing usage)
Please provide a link or source to the relevant docs
https://docs.nvidia.com/cuda/cutile-python/performance.html
Describe the problems in the documentation
Thanks for your awesome work.
I believe this official repo would bring significant efficiency gains over a naive PyTorch implementation. But I recommend adding a speed benchmark compared with other languages like Triton and ThunderKittens. This would be incredibly helpful for users.
(Optional) Propose a correction
No response
Contributing Guidelines
- [x] I agree to follow cuTile Python's contributing guidelines
- [x] I have searched the open documentation and have found no duplicates for this documentation request
@EricLina thank you for your interest. We don't have plan to publish any benchmarks, but we have some samples for you to get started on running your own comparison. The focus of this project is building a good programming language for targeting CUDA TileIR with a few kernel examples as reference. The TileGym project provide more kernel and e2e model examples: https://github.com/nvidia/tilegym.