[DOC]: Could you add the speed comparion with Triton/ThunderKittens?

Open EricLina opened this issue 2 weeks ago • 1 comments

How would you describe the priority of this documentation request?

Critical (currently preventing usage)

Please provide a link or source to the relevant docs

https://docs.nvidia.com/cuda/cutile-python/performance.html

Describe the problems in the documentation

Thanks for your awesome work.

I believe this official repo would bring significant efficiency gains over a naive PyTorch implementation. But I recommend adding a speed benchmark compared with other languages like Triton and ThunderKittens. This would be incredibly helpful for users.

(Optional) Propose a correction

No response

Contributing Guidelines

[x] I agree to follow cuTile Python's contributing guidelines
[x] I have searched the open documentation and have found no duplicates for this documentation request

Dec 06 '25 13:12 EricLina

@EricLina thank you for your interest. We don't have plan to publish any benchmarks, but we have some samples for you to get started on running your own comparison. The focus of this project is building a good programming language for targeting CUDA TileIR with a few kernel examples as reference. The TileGym project provide more kernel and e2e model examples: https://github.com/nvidia/tilegym.

Dec 06 '25 16:12 haijieg