VMVX+ukernel+mmt4d+arm64 benchmarks
This set of benchmarks corresponds to the workloads that I'm tracking in ongoing work.
This replaces the existing VMVX benchmarks. The comment there suggested that this was just about having any VMVX there at all, not necessarily caring about the particulars.
Note to reviewers. GROUP_NAME and TARGET_ARCHITECTURE are kept unchanged because the benchmarking scripts and CI build scripts care about those. It could make sense to change them at some point, but these other places would have to be adjusted simultaneously. In fact, it kind of makes sense to have "arm64,i8mm" as part of the "benchmarking mode" rather than target architecture, because VMVX is meant to be the target architecture and "arm64,i8mm" is merely a "tuning hint".
Abbreviated Benchmark Summary
@ commit 780d182c31e833f868617a3dbc27e4266529da84 (no previous benchmark results to compare against since eae331bc9e72915a3490a964b52b59b1898fa6ec)
Raw Latencies
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| DeepLabV3 [fp32] (TFLite) full-inference,default-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730) | 41.217 | 41.749 | 1.317 |
| DeepLabV3 [fp32] (TFLite) full-inference,experimental-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730) | 40.280 | 40.598 | 1.169 |
| MobileBertSquad [fp32] (TFLite) full-inference,default-flags with IREE-Vulkan @ XT2201-2 (GPU-Adreno-730) | 288.811 | 294.673 | 16.533 |
[Top 3 out of 183 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
PTAL! Just realized this wasn't merged.
OK, let us revisit when vmvx latencies are down; they are currently >100x what they should be, so it's reasonable to expect that to change as we're starting to look into vmvx performance.
If it would be useful to turn on mmt4d but leave it at 4 cores, I think that might be fine. Should at least give you some signal if it's useful :-)
Thanks for the suggestion but it's more useful to me to have a local 1-thread benchmark for now.
Also, when something is 100x to 1000x slower than it should be, we should fix that rather than throw more cores at it :-)