Nicolas Macchioni issues

Results 6 issues of


                                            Nicolas Macchioni

[pt2] grouped triton benchmarking in max-autotune-gemm mode

Differential Revision: D57371634 we can save a significant amount of benchmarking time in max-autotune-gemm mode if we group the benchmarking of Triton templates and backout early of templates that don't...

fb-exported

module: inductor

ciflow/inductor

verbose cache entries for gemm tunings

add an option to switch triton hash key to a more verbose output that can help with performance debugging; the hash key now includes Triton template configs like BLOCK_M, BLOCK_N,...

module: inductor

ciflow/inductor

Flush Exact L2 Cache Size in Benchmarking

I'm currently working on reducing Inductor's compile time overhead in max-autotune-gemm mode. As part of this effort, I profiled some individual matmul autotunings and noticed that `do_bench` was particularly expensive....

smart flushing

Fixes #ISSUE_NUMBER cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

module: inductor

ciflow/inductor

[do_bench_cudagraph] estimate runtime without creating a graph

Copy/pasted the estimation loop from `do_bench` into `do_bench_cudagraph` in favor of the original create graph -> measure replay methodology. Creating a graph is expensive (~300ms on A100 for me), even...

[inductor][fx] clarify padding logic

Summary: `should_pad_common` and `should_pad_bench` logic were semi-intertwined which can make working with the padding logic difficult previously there was no clear delineation as to what logic belonged in which of...

fb-exported

ciflow/trunk

topic: not user facing

module: inductor

ciflow/inductor

meta-exported