[GPU][DT] Scheduling inner-loop ukernels

Open jtuyls opened this issue 3 weeks ago • 0 comments

Creating an issue to discuss and track the scheduling of 'inner-loop ukernels' to replace the existing MLIR ukernels in most workloads.

From @MaheshRavishankar on discord (https://discord.com/channels/689900678990135345/1254843174111678555/1443009208181063805):

I wanted to round out the discussion, and record state here to be picked up after the break (or at least my break 😛 ). We have been using the existing MLIR ukernels and modifying them for using with data tiling. Thats a good approach for making something like llama 405b working well (and we did that initially for llama 8b as well which was good to do for evaluation). It is probably not worth adding more ukernels to make llama 8b faster on different architectures, at least not the ukernels that we are using currently that are implementing the ping-ping schedule. What we need to think about is using a ukernel that is closer to the CPU ukernel usage, where the ukernel is to inject the instruction schedule of the "innermost loop". Effectively data-tiling gets us to a path where we have a fixed inner-block and that we just schedule this explicitly. This will mean the ukernels can be pretty much used by default irrespective of the shape etc. (could even work for dynamic shapes). There might be few pieces to connect here to make that work, but it might be better to do that, than to keep taking the existing ping-pong ukernel for things like llama 8b and porting them over. We might still use that path for 405b (or similar performance sensitive workflow), but for others we need to think of solutions that are more broadly applicable.

cc @Yu-Zhewen @Abhishek-Varma @hanhanW @Max191 @bjacob

Dec 08 '25 08:12 jtuyls