iree
iree copied to clipboard
Adds passes for (pad + cumsumer) to CPU pipeline and enables benchmarks.
The benchmarks are tracked under experimental-flags.
Abbreviated Linux Benchmark Summary
@ commit 3d9680a84ccaf7121ba5c297fd3900a9e67f96a1 (no previous benchmark results to compare against since 699f33f657fbbf5584962872c3ddcbaa6bfc6019)
Raw Latencies
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
DeepLabV3 [fp32] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 29.943 | 29.846 | 0.236 |
DeepLabV3 [fp32] (TFLite) 1-thread,full-inference,experimental-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 26.824 | 26.828 | 0.057 |
DeepLabV3 [fp32] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake) | 12.767 | 12.753 | 0.069 |
[Top 3 out of 76 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
Abbreviated Benchmark Summary
@ commit 3d9680a84ccaf7121ba5c297fd3900a9e67f96a1 (vs. base 3470c608ebd21e9819fd403e5e3cd4e7361c7c25)
Regressed Latencies 🚩
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A) | 47.346 (vs. 42.248, 12.07%↑) | 47.495 | 1.481 |
PoseNet [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) | 84.517 (vs. 77.792, 8.65%↑) | 84.382 | 1.799 |
Improved Latencies 🎉
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) | 42.294 (vs. 51.008, 17.08%↓) | 42.215 | 0.230 |
MobileSSD [fp32] (TFLite) 1-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A) | 86.076 (vs. 97.439, 11.66%↓) | 86.104 | 0.067 |
MobileSSD [fp32] (TFLite) big-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A) | 96.089 (vs. 108.043, 11.06%↓) | 96.116 | 0.196 |
[Top 3 out of 8 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
Thanks, Hanhan! Could you please elaborate on why we have to limit the fusion transformation to x86 CPUs and we can't apply it generally to all the backends?
The regressions seem real to me?
Thanks, Hanhan! Could you please elaborate on why we have to limit the fusion transformation to x86 CPUs and we can't apply it generally to all the backends?
Ideally the default should be fusion with producer. We have all the pieces to plumb that through, but needs to be owned by someone to connect it end to end.
The regressions seem real to me?
Weird.. I thought all the regression was caught and addressed. I did look into the DeepLab case, let me take a look at it.
I see the regression is gone with this. Great!
I double checked the benchmark flags, are we missing the "--iree-flow-enable-fuse-padding-into-linalg-consumer-ops"
for ARMv8 local-sync benchmark? (line 255)
I double checked the benchmark flags, are we missing the
"--iree-flow-enable-fuse-padding-into-linalg-consumer-ops"
for ARMv8 local-sync benchmark? (line 255)
Yes, you're right. Good catch..