Improvements to Quick Tuning
When benchmarking kernels during the Quick tune (exhaustive as well), the algorithm is to take the average of 10 runs per tried Kernel and then compare to the other configs. The Winning Kernel config is the one with the best average time.
The complete 10 runs are not recorded. The goal here is to capture all the times runs and print the min, max, and median.
Lastly add a capability to change the picking algorithm from Average to .... Min, or Median
As an example what we see today...
MIGRAPHX_TRACE_BENCHMARKING=2 MIGRAPHX_TRACE_MLIR=2
Problem: gfx1150 12 -t f16 -out_datatype f16 -transA false -transB true -g 1 -m 1 -n 4096 -k 4096 Benchmarking solution: v2:16,256,4,16,64,4,1,1,1 => ((16256) / (1664)) * 32 = 128 2.6971ms
What we would like to see... Problem: gfx1150 12 -t f16 -out_datatype f16 -transA false -transB true -g 1 -m 1 -n 4096 -k 4096 Benchmarking solution: v2:16,256,4,16,64,4,1,1,1 => ((16256) / (1664)) * 32 = 128 2.6971ms, 2.0 min, 23.9 max, 2.50 med
The complete 10 runs are not recorded. The goal here is to capture all the times runs and print the min, max, and median. Lastly add a capability to change the picking algorithm from Average to .... Min, or Median
This is not possible. We dont time each run on purpose because we want to minimize the launch overhead when we are benchmarking so we can get closer to the actual device time.
Time to start. We are still seeing too many inconsistencies with driver results and no answer as to why. This will help