Hongtao Yu
Hongtao Yu
> > > @ThomasRaoux It looks like the test failure is due to an unhandled case in wgmma to LLVM lowering, where the accumulator was zero. The zero accumulator was...
Thanks a lot for measuring perf for this change. At this point it appears that the heuristics needs more tweaks. > So one example should be easy to reproduce, for...
> > What is the rationale behind this? Is that because less-than-4-byte load per thread would result in a waste of a 128-byte memory transaction per warp? > > yes...
> I scan quickly through the code, could you add a high level comment explaining the heuristic as I'm not sure I fully understand from the code. Thanks for taking...
Summary updated to reflect the new heuristic.
> @htyu @ThomasRaoux is the optimization still on the menu? I am just learning how to auto coalesce global access to SMEM (to make sure data load store continuously). >...
> This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not I'll have...
> > > This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not...
@ThomasRaoux is there a way to share your benchmarks regressed by this patch for me to investigate? We see this could be a general fix and we would like to...
> I do think it is good patch in general but it would be great if you could fix the regression in torch benchmarks, then I can check if it...