Hongtao Yu
Hongtao Yu
Thanks for working on this. Can I ask where do you get the new configs and how they move the perf numbers? BTW, there are already-tuned numbers in the Triton...
> > Thanks for working on this. > > Can I ask where do you get the new configs and how they move the perf numbers? > > BTW, there...
> Less than double what ? can we get a perf run of this before landing. The mm template is only used in a few OSS models, so I'd expect...
I'm sending out this diff to get early feedbacks. Regarding the performance testing, I'm still looking for memory-bound kernels with heavy computations. Please share if you have such kernels. The...
> if loop_annotation || (matmul_loop && global_num_stage > 1) Sounds good to check against the annotation. How do you think we should handle matmul loops with extra loads? E.g, one...
> > A load of an indexing tensor which is in turn used to load the one dot operand is pipelined. > > Hey @htyu I am finishing up PR...
> > I think we can support this feature, I was asking as the PR is out of date right now but it is fine to me if we want...
I will do a rebasing. The test case (in test/TritonGPU/loop-pipeline.mlir) has been updated with that loop annotation. Let me know if that looks good. Thanks.
Rebasing done.
> One final comment is that maybe we want to lift the logic out of `MatmulLoopPipeline` later since it's not "matmul" anymore? It's a good point. Or maybe we could...