iree icon indicating copy to clipboard operation
iree copied to clipboard

[WIP] Benchmark pad + conv on CPU

Open hanhanW opened this issue 3 years ago • 3 comments

hanhanW avatar Sep 09 '22 22:09 hanhanW

Abbreviated Benchmark Summary

@ commit dfe79ebb73d0348e3bbcd581b02624212b508ac7 (vs. base 8d1e6abc3a7e0e9734cd28128905b64af582ec55)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV3Small [fp32,imagenet] (TFLite) little-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) 66.507 (vs. 59.634, 11.53%↑) 66.403 0.338
MobileNetV3Small [fp32,imagenet] (TFLite) little-core,full-inference,default-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) 77.357 (vs. 71.954, 7.51%↑) 77.477 0.398
MobileNetV3Small [fp32,imagenet] (TFLite) big-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) 14.436 (vs. 13.454, 7.30%↑) 14.440 0.021

[Top 3 out of 4 results showed]

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) 40.209 (vs. 49.689, 19.08%↓) 40.189 0.123
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A) 39.199 (vs. 47.462, 17.41%↓) 39.426 0.788
MobileNetV2 [fp32,imagenet] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A) 14.900 (vs. 17.947, 16.98%↓) 14.893 0.257

[Top 3 out of 30 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

iree-github-actions-bot avatar Sep 13 '22 23:09 iree-github-actions-bot

Abbreviated Linux Benchmark Summary

@ commit dfe79ebb73d0348e3bbcd581b02624212b508ac7 (vs. base 8d1e6abc3a7e0e9734cd28128905b64af582ec55)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV3Small [fp32,imagenet] (TFLite) full-inference,default-flags with IREE-LLVM-CPU-Sync @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 4.829 (vs. 4.350, 11.01%↑) 4.827 0.031
MobileNetV3Small [fp32,imagenet] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 4.916 (vs. 4.525, 8.65%↑) 4.920 0.023

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV2 [fp32,imagenet] (TFLite) 8-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 4.234 (vs. 5.982, 29.22%↓) 4.242 0.018
MobileNetV2 [fp32,imagenet] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 4.932 (vs. 6.930, 28.84%↓) 4.933 0.027
MobileSSD [fp32] (TFLite) 8-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 9.043 (vs. 12.648, 28.50%↓) 8.963 0.144

[Top 3 out of 17 results showed]

Regressed Compilation Times 🚩

Benchmark Name Compilation Time (ms)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags 13212 (vs. 11837, 11.62%↑)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags 13212 (vs. 11837, 11.62%↑)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags 13212 (vs. 11837, 11.62%↑)

[Top 3 out of 17 results showed]

Improved Compilation Times 🎉

Benchmark Name Compilation Time (ms)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags 7465 (vs. 8217, 9.15%↓)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags 7465 (vs. 8217, 9.15%↓)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags 7465 (vs. 8217, 9.15%↓)

[Top 3 out of 5 results showed]

Regressed Total Dispatch Sizes 🚩

Benchmark Name Total Dispatch Size (bytes)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags 247976 (vs. 212776, 16.54%↑)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags 247976 (vs. 212776, 16.54%↑)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags 247976 (vs. 212776, 16.54%↑)

[Top 3 out of 22 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name Total Dispatch Size (bytes)
PersonDetect [int8] (TFLite) CPU-RV64-Generic full-inference,default-flags 66392 (vs. 81464, 18.50%↓)
PersonDetect [int8] (TFLite) CPU-RV32-Generic full-inference,default-flags 197624 (vs. 220680, 10.45%↓)

For more information:

iree-github-actions-bot avatar Sep 13 '22 23:09 iree-github-actions-bot

note -- The regression of deeplabv3 on Pixel 6 are mostly from pad + depthwise_conv. The biggest dispatch becomes 14 ms (v.s. 2.18 ms).

Here is an example input IR:

%206 = "tosa.depthwise_conv2d"(%204, %205, %88) {dilation = [4, 4], pad = [4, 4, 4, 4], stride = [1, 1]} : (tensor<1x33x33x480xf32>, tensor<3x3x480x1xf32>, tensor<480xf32>) -> tensor<1x33x33x480xf32>

hanhanW avatar Sep 14 '22 00:09 hanhanW