AMDMIGraphX Improvements to Quick Tuning

When benchmarking kernels during the Quick tune (exhaustive as well), the algorithm is to take the average of 10 runs per tried Kernel and then compare to the other configs. The Winning Kernel config is the one with the best average time.

The complete 10 runs are not recorded. The goal here is to capture all the times runs and print the min, max, and median.
Lastly add a capability to change the picking algorithm from Average to .... Min, or Median

As an example what we see today...

MIGRAPHX_TRACE_BENCHMARKING=2 MIGRAPHX_TRACE_MLIR=2

Problem: gfx1150 12 -t f16 -out_datatype f16 -transA false -transB true -g 1 -m 1 -n 4096 -k 4096 Benchmarking solution: v2:16,256,4,16,64,4,1,1,1 => ((16256) / (1664)) * 32 = 128 2.6971ms

What we would like to see... Problem: gfx1150 12 -t f16 -out_datatype f16 -transA false -transB true -g 1 -m 1 -n 4096 -k 4096 Benchmarking solution: v2:16,256,4,16,64,4,1,1,1 => ((16256) / (1664)) * 32 = 128 2.6971ms, 2.0 min, 23.9 max, 2.50 med

Dec 10 '24 18:12 causten

The complete 10 runs are not recorded. The goal here is to capture all the times runs and print the min, max, and median. Lastly add a capability to change the picking algorithm from Average to .... Min, or Median

This is not possible. We dont time each run on purpose because we want to minimize the launch overhead when we are benchmarking so we can get closer to the actual device time.

Dec 10 '24 20:12 pfultz2

Time to start. We are still seeing too many inconsistencies with driver results and no answer as to why. This will help

Dec 11 '24 01:12 causten