iree icon indicating copy to clipboard operation
iree copied to clipboard

Adds passes for (pad + cumsumer) to CPU pipeline and enables benchmarks.

Open hanhanW opened this issue 2 years ago • 2 comments

The benchmarks are tracked under experimental-flags.

hanhanW avatar Sep 16 '22 23:09 hanhanW

Abbreviated Linux Benchmark Summary

@ commit 3d9680a84ccaf7121ba5c297fd3900a9e67f96a1 (no previous benchmark results to compare against since 699f33f657fbbf5584962872c3ddcbaa6bfc6019)

Raw Latencies

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
DeepLabV3 [fp32] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 29.943 29.846 0.236
DeepLabV3 [fp32] (TFLite) 1-thread,full-inference,experimental-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 26.824 26.828 0.057
DeepLabV3 [fp32] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) 12.767 12.753 0.069

[Top 3 out of 76 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

iree-github-actions-bot avatar Sep 16 '22 23:09 iree-github-actions-bot

Abbreviated Benchmark Summary

@ commit 3d9680a84ccaf7121ba5c297fd3900a9e67f96a1 (vs. base 3470c608ebd21e9819fd403e5e3cd4e7361c7c25)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A) 47.346 (vs. 42.248, 12.07%↑) 47.495 1.481
PoseNet [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) 84.517 (vs. 77.792, 8.65%↑) 84.382 1.799

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) 42.294 (vs. 51.008, 17.08%↓) 42.215 0.230
MobileSSD [fp32] (TFLite) 1-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) 86.076 (vs. 97.439, 11.66%↓) 86.104 0.067
MobileSSD [fp32] (TFLite) big-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) 96.089 (vs. 108.043, 11.06%↓) 96.116 0.196

[Top 3 out of 8 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

iree-github-actions-bot avatar Sep 16 '22 23:09 iree-github-actions-bot

Thanks, Hanhan! Could you please elaborate on why we have to limit the fusion transformation to x86 CPUs and we can't apply it generally to all the backends?

dcaballe avatar Sep 26 '22 22:09 dcaballe

The regressions seem real to me?

MaheshRavishankar avatar Sep 26 '22 22:09 MaheshRavishankar

Thanks, Hanhan! Could you please elaborate on why we have to limit the fusion transformation to x86 CPUs and we can't apply it generally to all the backends?

Ideally the default should be fusion with producer. We have all the pieces to plumb that through, but needs to be owned by someone to connect it end to end.

MaheshRavishankar avatar Sep 26 '22 22:09 MaheshRavishankar

The regressions seem real to me?

Weird.. I thought all the regression was caught and addressed. I did look into the DeepLab case, let me take a look at it.

hanhanW avatar Sep 26 '22 23:09 hanhanW

I see the regression is gone with this. Great!

MaheshRavishankar avatar Oct 03 '22 20:10 MaheshRavishankar

I double checked the benchmark flags, are we missing the "--iree-flow-enable-fuse-padding-into-linalg-consumer-ops" for ARMv8 local-sync benchmark? (line 255)

pzread avatar Nov 08 '22 18:11 pzread

I double checked the benchmark flags, are we missing the "--iree-flow-enable-fuse-padding-into-linalg-consumer-ops" for ARMv8 local-sync benchmark? (line 255)

Yes, you're right. Good catch..

hanhanW avatar Nov 08 '22 19:11 hanhanW