tilelang Feedback Thread

Please leave comments here about your usage of TileLang. Do you like it? Do you dislike it? Which feature do you need in order to adopt it?

For in depth discussion, please feel free to join our discord or wechat.

Feb 25 '25 07:02 xiayuqing0622

Any new WeChat QR code?

Mar 05 '25 02:03 SingularityKChen

CC @SingularityKChen

Mar 05 '25 03:03 LeiWang1999

@LeiWang1999 我这边扫描二维码，显示过期了

Mar 05 '25 04:03 LegendBC

@LegendBC CC the latest QR code, it's expected to expire before Mat 12th.

Mar 05 '25 12:03 LeiWang1999

thanks a lot！

Mar 05 '25 13:03 LegendBC

@LeiWang1999

Thanks for the great library!

Looking to contribute and would like to understand the library at a deeper level:

do you have any examples of how to implement an optimization pass can be implemented.
- E.g., async pipelining of MMA on Hopper or ping-pong kernel design as is implemented in Cutlass or FA3?
- At what level of IR would one do this and what APIs are exposed for such purposes?
- Are there minimal examples that I could study?
how much does tilelang leverage TVM for optimization passes? Do you have any recommendations on tutorials or resources for getting up to speed on TVM (I've seen the mlc ml compilation course).

Thanks!

Mar 06 '25 00:03 jeromeku

Thanks for your attention, @jeromeku

TileLang Pass shares the same infrastructure and AST as Apache TVM, so you can write a pass just as you would in TVM.
You can find all TileLang-specific pass implementations here. Thi includes passes you might be interested in, such as lower_hopper_intrin.cc and inject_pipeline.cc.
Passes are applied using Tensor IR, just like in TVM.
If you’re looking to write a pass, we have test cases for all TileLang passes available here.
TileLang utilizes TVM’s core AST and pass infrastructure, along with its powerful affine and arithmetic analysis tools. However, it does not use TVM’s end-to-end graph optimizations or auto-tuning—only a small subset of tvm is utilized.

Mar 06 '25 05:03 LeiWang1999

Thanks for the great project! As a beginner in kernel programming, I've chosen Tilelang over Triton as my first kernel language (except CUDA C++) to learn and found it much more intuitive and straightforward than Triton. From the perspective of an user, however, I found some parts of the documentation to be a bit unclear and would appreciate some clarification on the following points:

Could you clarify the storage location (register / local / shared / global) for alloc_var? Is this a low-level concept adapted from TVM? (I barely knows anything about TVM, so… ;) )
What is the difference between alloc_fragment and alloc_reducer? Is the latter just syntactic sugar for the former?
How are the stages in T.Pipelined automatically partitioned and how could I decide the optimal value of num_stages?
- (For now I've been just repeatedly increasing num_stages by one until it fails to compile 😅)
Is T.Pipelined(num_stages=0) equivalent to T.serial?
Are there explicit rules for nesting different for in-loops?
- For example, it's clear that T.gemm cannot be used inside T.Parallel and T.vectorized.
It took me a while to realize that expressions in Tilelang are name-bound rather than immediately evaluated. Might it be worth clarifying in the documentation?
- For example, cond = array[0, 0] > 0 doesn't immediately store a boolean value in the cond variable. Instead, it binds the expression to the name cond. If the value of array[0, 0] is changed later, the value of cond changes accordingly.

Oct 03 '25 09:10 w568w

tilelang tilelang copied to clipboard

Feedback Thread

tilelang
tilelang copied to clipboard