[WIP] Benchmark pad + conv on CPU
Abbreviated Benchmark Summary
@ commit dfe79ebb73d0348e3bbcd581b02624212b508ac7 (vs. base 8d1e6abc3a7e0e9734cd28128905b64af582ec55)
Regressed Latencies 🚩
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| MobileNetV3Small [fp32,imagenet] (TFLite) little-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) | 66.507 (vs. 59.634, 11.53%↑) | 66.403 | 0.338 |
| MobileNetV3Small [fp32,imagenet] (TFLite) little-core,full-inference,default-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) | 77.357 (vs. 71.954, 7.51%↑) | 77.477 | 0.398 |
| MobileNetV3Small [fp32,imagenet] (TFLite) big-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) | 14.436 (vs. 13.454, 7.30%↑) | 14.440 | 0.021 |
[Top 3 out of 4 results showed]
Improved Latencies 🎉
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) | 40.209 (vs. 49.689, 19.08%↓) | 40.189 | 0.123 |
| MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A) | 39.199 (vs. 47.462, 17.41%↓) | 39.426 | 0.788 |
| MobileNetV2 [fp32,imagenet] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A) | 14.900 (vs. 17.947, 16.98%↓) | 14.893 | 0.257 |
[Top 3 out of 30 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
Abbreviated Linux Benchmark Summary
@ commit dfe79ebb73d0348e3bbcd581b02624212b508ac7 (vs. base 8d1e6abc3a7e0e9734cd28128905b64af582ec55)
Regressed Latencies 🚩
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| MobileNetV3Small [fp32,imagenet] (TFLite) full-inference,default-flags with IREE-LLVM-CPU-Sync @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 4.829 (vs. 4.350, 11.01%↑) | 4.827 | 0.031 |
| MobileNetV3Small [fp32,imagenet] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 4.916 (vs. 4.525, 8.65%↑) | 4.920 | 0.023 |
Improved Latencies 🎉
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| MobileNetV2 [fp32,imagenet] (TFLite) 8-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 4.234 (vs. 5.982, 29.22%↓) | 4.242 | 0.018 |
| MobileNetV2 [fp32,imagenet] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 4.932 (vs. 6.930, 28.84%↓) | 4.933 | 0.027 |
| MobileSSD [fp32] (TFLite) 8-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 9.043 (vs. 12.648, 28.50%↓) | 8.963 | 0.144 |
[Top 3 out of 17 results showed]
Regressed Compilation Times 🚩
| Benchmark Name | Compilation Time (ms) |
|---|---|
| DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags | 13212 (vs. 11837, 11.62%↑) |
| DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags | 13212 (vs. 11837, 11.62%↑) |
| DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags | 13212 (vs. 11837, 11.62%↑) |
[Top 3 out of 17 results showed]
Improved Compilation Times 🎉
| Benchmark Name | Compilation Time (ms) |
|---|---|
| PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags | 7465 (vs. 8217, 9.15%↓) |
| PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags | 7465 (vs. 8217, 9.15%↓) |
| PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags | 7465 (vs. 8217, 9.15%↓) |
[Top 3 out of 5 results showed]
Regressed Total Dispatch Sizes 🚩
| Benchmark Name | Total Dispatch Size (bytes) |
|---|---|
| MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags | 247976 (vs. 212776, 16.54%↑) |
| MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags | 247976 (vs. 212776, 16.54%↑) |
| MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags | 247976 (vs. 212776, 16.54%↑) |
[Top 3 out of 22 results showed]
Improved Total Dispatch Sizes 🎉
| Benchmark Name | Total Dispatch Size (bytes) |
|---|---|
| PersonDetect [int8] (TFLite) CPU-RV64-Generic full-inference,default-flags | 66392 (vs. 81464, 18.50%↓) |
| PersonDetect [int8] (TFLite) CPU-RV32-Generic full-inference,default-flags | 197624 (vs. 220680, 10.45%↓) |
For more information:
note -- The regression of deeplabv3 on Pixel 6 are mostly from pad + depthwise_conv. The biggest dispatch becomes 14 ms (v.s. 2.18 ms).
Here is an example input IR:
%206 = "tosa.depthwise_conv2d"(%204, %205, %88) {dilation = [4, 4], pad = [4, 4, 4, 4], stride = [1, 1]} : (tensor<1x33x33x480xf32>, tensor<3x3x480x1xf32>, tensor<480xf32>) -> tensor<1x33x33x480xf32>