iree VMVX+ukernel+mmt4d+arm64 benchmarks

This set of benchmarks corresponds to the workloads that I'm tracking in ongoing work.

This replaces the existing VMVX benchmarks. The comment there suggested that this was just about having any VMVX there at all, not necessarily caring about the particulars.

Note to reviewers. GROUP_NAME and TARGET_ARCHITECTURE are kept unchanged because the benchmarking scripts and CI build scripts care about those. It could make sense to change them at some point, but these other places would have to be adjusted simultaneously. In fact, it kind of makes sense to have "arm64,i8mm" as part of the "benchmarking mode" rather than target architecture, because VMVX is meant to be the target architecture and "arm64,i8mm" is merely a "tuning hint".

Sep 22 '22 17:09 bjacob

Abbreviated Benchmark Summary

@ commit 780d182c31e833f868617a3dbc27e4266529da84 (no previous benchmark results to compare against since eae331bc9e72915a3490a964b52b59b1898fa6ec)

Raw Latencies

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
DeepLabV3 [fp32] (TFLite) full-inference,default-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730)	41.217	41.749	1.317
DeepLabV3 [fp32] (TFLite) full-inference,experimental-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730)	40.280	40.598	1.169
MobileBertSquad [fp32] (TFLite) full-inference,default-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730)	288.811	294.673	16.533

[Top 3 out of 183 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

Sep 22 '22 19:09 iree-github-actions-bot

PTAL! Just realized this wasn't merged.

Sep 28 '22 00:09 bjacob

OK, let us revisit when vmvx latencies are down; they are currently >100x what they should be, so it's reasonable to expect that to change as we're starting to look into vmvx performance.

Sep 28 '22 15:09 bjacob

If it would be useful to turn on mmt4d but leave it at 4 cores, I think that might be fine. Should at least give you some signal if it's useful :-)

Sep 28 '22 15:09 GMNGeoffrey

Thanks for the suggestion but it's more useful to me to have a local 1-thread benchmark for now.

Also, when something is 100x to 1000x slower than it should be, we should fix that rather than throw more cores at it :-)

Sep 28 '22 15:09 bjacob