iree Adds passes for (pad + cumsumer) to CPU pipeline and enables benchmarks.

The benchmarks are tracked under experimental-flags.

Sep 16 '22 23:09 hanhanW

Abbreviated Linux Benchmark Summary

@ commit 3d9680a84ccaf7121ba5c297fd3900a9e67f96a1 (no previous benchmark results to compare against since 699f33f657fbbf5584962872c3ddcbaa6bfc6019)

Raw Latencies

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
DeepLabV3 [fp32] (TFLite) 1-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	29.943	29.846	0.236
DeepLabV3 [fp32] (TFLite) 1-thread,full-inference,experimental-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	26.824	26.828	0.057
DeepLabV3 [fp32] (TFLite) 4-thread,full-inference,default-flags with IREE-LLVM-CPU @ GCP-c2-standard-16 (CPU-x86\_64-CascadeLake)	12.767	12.753	0.069

[Top 3 out of 76 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

Sep 16 '22 23:09 iree-github-actions-bot

Abbreviated Benchmark Summary

@ commit 3d9680a84ccaf7121ba5c297fd3900a9e67f96a1 (vs. base 3470c608ebd21e9819fd403e5e3cd4e7361c7c25)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-6-Pro (CPU-ARMv8.2-A)	47.346 (vs. 42.248, 12.07%↑)	47.495	1.481
PoseNet [fp32] (TFLite) 4-thread,big-core,full-inference,default-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A)	84.517 (vs. 77.792, 8.65%↑)	84.382	1.799

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileSSD [fp32] (TFLite) 4-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A)	42.294 (vs. 51.008, 17.08%↓)	42.215	0.230
MobileSSD [fp32] (TFLite) 1-thread,big-core,full-inference,experimental-flags with IREE-LLVM-CPU @ Pixel-4 (CPU-ARMv8.2-A)	86.076 (vs. 97.439, 11.66%↓)	86.104	0.067
MobileSSD [fp32] (TFLite) big-core,full-inference,experimental-flags with IREE-LLVM-CPU-Sync @ Pixel-4 (CPU-ARMv8.2-A)	96.089 (vs. 108.043, 11.06%↓)	96.116	0.196

[Top 3 out of 8 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

Sep 16 '22 23:09 iree-github-actions-bot

Thanks, Hanhan! Could you please elaborate on why we have to limit the fusion transformation to x86 CPUs and we can't apply it generally to all the backends?

Sep 26 '22 22:09 dcaballe

The regressions seem real to me?

Sep 26 '22 22:09 MaheshRavishankar

Thanks, Hanhan! Could you please elaborate on why we have to limit the fusion transformation to x86 CPUs and we can't apply it generally to all the backends?

Ideally the default should be fusion with producer. We have all the pieces to plumb that through, but needs to be owned by someone to connect it end to end.

Sep 26 '22 22:09 MaheshRavishankar

The regressions seem real to me?

Weird.. I thought all the regression was caught and addressed. I did look into the DeepLab case, let me take a look at it.

Sep 26 '22 23:09 hanhanW

I see the regression is gone with this. Great!

Oct 03 '22 20:10 MaheshRavishankar

I double checked the benchmark flags, are we missing the "--iree-flow-enable-fuse-padding-into-linalg-consumer-ops" for ARMv8 local-sync benchmark? (line 255)

Nov 08 '22 18:11 pzread

I double checked the benchmark flags, are we missing the "--iree-flow-enable-fuse-padding-into-linalg-consumer-ops" for ARMv8 local-sync benchmark? (line 255)

Yes, you're right. Good catch..

Nov 08 '22 19:11 hanhanW

iree iree copied to clipboard

Adds passes for (pad + cumsumer) to CPU pipeline and enables benchmarks.

Abbreviated Linux Benchmark Summary

Raw Latencies

Abbreviated Benchmark Summary

Regressed Latencies 🚩

Improved Latencies 🎉

iree
iree copied to clipboard