tilelang
tilelang copied to clipboard
Feedback Thread
Please leave comments here about your usage of TileLang. Do you like it? Do you dislike it? Which feature do you need in order to adopt it?
For in depth discussion, please feel free to join our discord or wechat.
Any new WeChat QR code?
CC @SingularityKChen
@LeiWang1999 我这边扫描二维码,显示过期了
@LegendBC CC the latest QR code, it's expected to expire before Mat 12th.
thanks a lot!
@LeiWang1999
Thanks for the great library!
Looking to contribute and would like to understand the library at a deeper level:
- do you have any examples of how to implement an optimization pass can be implemented.
- E.g., async pipelining of MMA on Hopper or ping-pong kernel design as is implemented in Cutlass or FA3?
- At what level of IR would one do this and what APIs are exposed for such purposes?
- Are there minimal examples that I could study?
- how much does
tilelangleverage TVM for optimization passes? Do you have any recommendations on tutorials or resources for getting up to speed on TVM (I've seen the mlc ml compilation course).
Thanks!
Thanks for your attention, @jeromeku
- TileLang Pass shares the same infrastructure and AST as Apache TVM, so you can write a pass just as you would in TVM.
- You can find all TileLang-specific pass implementations here. Thi includes passes you might be interested in, such as lower_hopper_intrin.cc and inject_pipeline.cc.
- Passes are applied using Tensor IR, just like in TVM.
- If you’re looking to write a pass, we have test cases for all TileLang passes available here.
- TileLang utilizes TVM’s core AST and pass infrastructure, along with its powerful affine and arithmetic analysis tools. However, it does not use TVM’s end-to-end graph optimizations or auto-tuning—only a small subset of tvm is utilized.
Thanks for the great project! As a beginner in kernel programming, I've chosen Tilelang over Triton as my first kernel language (except CUDA C++) to learn and found it much more intuitive and straightforward than Triton. From the perspective of an user, however, I found some parts of the documentation to be a bit unclear and would appreciate some clarification on the following points:
- Could you clarify the storage location (register / local / shared / global) for
alloc_var? Is this a low-level concept adapted from TVM? (I barely knows anything about TVM, so… ;) ) - What is the difference between
alloc_fragmentandalloc_reducer? Is the latter just syntactic sugar for the former? - How are the stages in
T.Pipelinedautomatically partitioned and how could I decide the optimal value ofnum_stages?- (For now I've been just repeatedly increasing
num_stagesby one until it fails to compile 😅)
- (For now I've been just repeatedly increasing
- Is
T.Pipelined(num_stages=0)equivalent toT.serial? - Are there explicit rules for nesting different
for in-loops?- For example, it's clear that
T.gemmcannot be used insideT.ParallelandT.vectorized.
- For example, it's clear that
- It took me a while to realize that expressions in Tilelang are name-bound rather than immediately evaluated. Might it be worth clarifying in the documentation?
- For example,
cond = array[0, 0] > 0doesn't immediately store a boolean value in thecondvariable. Instead, it binds the expression to the namecond. If the value ofarray[0, 0]is changed later, the value ofcondchanges accordingly.
- For example,