iree [WIP] Benchmark pad + conv on CPU

Sep 09 '22 22:09 hanhanW

Abbreviated Benchmark Summary

@ commit dfe79ebb73d0348e3bbcd581b02624212b508ac7 (vs. base 8d1e6abc3a7e0e9734cd28128905b64af582ec55)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileNetV3Small [fp32,imagenet] (TFLite) little-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A)	66.507 (vs. 59.634, 11.53%↑)	66.403	0.338
MobileNetV3Small [fp32,imagenet] (TFLite) little-core,full-inference,default-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A)	77.357 (vs. 71.954, 7.51%↑)	77.477	0.398
MobileNetV3Small [fp32,imagenet] (TFLite) big-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A)	14.436 (vs. 13.454, 7.30%↑)	14.440	0.021

[Top 3 out of 4 results showed]

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A)	40.209 (vs. 49.689, 19.08%↓)	40.189	0.123
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A)	39.199 (vs. 47.462, 17.41%↓)	39.426	0.788
MobileNetV2 [fp32,imagenet] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A)	14.900 (vs. 17.947, 16.98%↓)	14.893	0.257

[Top 3 out of 30 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

Sep 13 '22 23:09 iree-github-actions-bot

Abbreviated Linux Benchmark Summary

@ commit dfe79ebb73d0348e3bbcd581b02624212b508ac7 (vs. base 8d1e6abc3a7e0e9734cd28128905b64af582ec55)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileNetV3Small [fp32,imagenet] (TFLite) full-inference,default-flags with IREE-LLVM-CPU-Sync @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	4.829 (vs. 4.350, 11.01%↑)	4.827	0.031
MobileNetV3Small [fp32,imagenet] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	4.916 (vs. 4.525, 8.65%↑)	4.920	0.023

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileNetV2 [fp32,imagenet] (TFLite) 8-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	4.234 (vs. 5.982, 29.22%↓)	4.242	0.018
MobileNetV2 [fp32,imagenet] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	4.932 (vs. 6.930, 28.84%↓)	4.933	0.027
MobileSSD [fp32] (TFLite) 8-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	9.043 (vs. 12.648, 28.50%↓)	8.963	0.144

[Top 3 out of 17 results showed]

Regressed Compilation Times 🚩

Benchmark Name	Compilation Time (ms)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	13212 (vs. 11837, 11.62%↑)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags	13212 (vs. 11837, 11.62%↑)
DeepLabV3 [fp32] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags	13212 (vs. 11837, 11.62%↑)

[Top 3 out of 17 results showed]

Improved Compilation Times 🎉

Benchmark Name	Compilation Time (ms)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	7465 (vs. 8217, 9.15%↓)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags	7465 (vs. 8217, 9.15%↓)
PersonDetect [int8] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags	7465 (vs. 8217, 9.15%↓)

[Top 3 out of 5 results showed]

Regressed Total Dispatch Sizes 🚩

Benchmark Name	Total Dispatch Size (bytes)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 8-thread,full-inference,default-flags	247976 (vs. 212776, 16.54%↑)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake full-inference,default-flags	247976 (vs. 212776, 16.54%↑)
MobileNetV3Small [fp32,imagenet] (TFLite) CPU-x86\_64-CascadeLake 4-thread,full-inference,default-flags	247976 (vs. 212776, 16.54%↑)

[Top 3 out of 22 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
PersonDetect [int8] (TFLite) CPU-RV64-Generic full-inference,default-flags	66392 (vs. 81464, 18.50%↓)
PersonDetect [int8] (TFLite) CPU-RV32-Generic full-inference,default-flags	197624 (vs. 220680, 10.45%↓)

For more information:

Sep 13 '22 23:09 iree-github-actions-bot

note -- The regression of deeplabv3 on Pixel 6 are mostly from pad + depthwise_conv. The biggest dispatch becomes 14 ms (v.s. 2.18 ms).

Here is an example input IR:

%206 = "tosa.depthwise_conv2d"(%204, %205, %88) {dilation = [4, 4], pad = [4, 4, 4, 4], stride = [1, 1]} : (tensor<1x33x33x480xf32>, tensor<3x3x480x1xf32>, tensor<480xf32>) -> tensor<1x33x33x480xf32>

Sep 14 '22 00:09 hanhanW