Hongtao Yu

Results 53 comments of Hongtao Yu

> > > @ThomasRaoux It looks like the test failure is due to an unhandled case in wgmma to LLVM lowering, where the accumulator was zero. The zero accumulator was...

Thanks a lot for measuring perf for this change. At this point it appears that the heuristics needs more tweaks. > So one example should be easy to reproduce, for...

> > What is the rationale behind this? Is that because less-than-4-byte load per thread would result in a waste of a 128-byte memory transaction per warp? > > yes...

> I scan quickly through the code, could you add a high level comment explaining the heuristic as I'm not sure I fully understand from the code. Thanks for taking...

> @htyu @ThomasRaoux is the optimization still on the menu? I am just learning how to auto coalesce global access to SMEM (to make sure data load store continuously). >...

> This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not I'll have...

> > > This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not...

@ThomasRaoux is there a way to share your benchmarks regressed by this patch for me to investigate? We see this could be a general fix and we would like to...

> I do think it is good patch in general but it would be great if you could fix the regression in torch benchmarks, then I can check if it...