Nicolas Macchioni
Nicolas Macchioni
Differential Revision: D57371634 we can save a significant amount of benchmarking time in max-autotune-gemm mode if we group the benchmarking of Triton templates and backout early of templates that don't...
add an option to switch triton hash key to a more verbose output that can help with performance debugging; the hash key now includes Triton template configs like BLOCK_M, BLOCK_N,...
I'm currently working on reducing Inductor's compile time overhead in max-autotune-gemm mode. As part of this effort, I profiled some individual matmul autotunings and noticed that `do_bench` was particularly expensive....
Fixes #ISSUE_NUMBER cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang
Copy/pasted the estimation loop from `do_bench` into `do_bench_cudagraph` in favor of the original create graph -> measure replay methodology. Creating a graph is expensive (~300ms on A100 for me), even...
Summary: `should_pad_common` and `should_pad_bench` logic were semi-intertwined which can make working with the padding logic difficult previously there was no clear delineation as to what logic belonged in which of...