move tensor lowerings out of FormDispatchWorkgroupsPass
- [x] Fill out tablegen description and summary ...
Abbreviated Benchmark Summary
@ commit c548b0c5bce9fb92de9ed64abc381c82453703ff (vs. base d792d2483a9d5d875a0f03b57c9d4f98e072c300)
Data-Tiling Comparison Table
Click to show
| Name | No-DT (baseline) | DT-Only | DT-UK |
|---|---|---|---|
| BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 218.903 (1.0X) | N/A | 104.858 (2.1X) |
| BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 798.932 (1.0X) | N/A | 223.272 (3.6X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 7.155 (1.0X) | N/A | 8.541 (0.8X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 32.133 (1.0X) | N/A | 29.804 (1.1X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 35.686 (1.0X) | N/A | 34.318 (1.0X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 275.682 (1.0X) | N/A | 228.384 (1.2X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.801 (1.0X) | N/A | 5.000 (1.2X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 26.988 (1.0X) | N/A | 13.066 (2.1X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 9.154 (1.0X) | N/A | 8.810 (1.0X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 70.398 (1.0X) | N/A | 38.495 (1.8X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.025 (1.0X) | N/A | 8.614 (1.3X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 88.746 (1.0X) | N/A | 40.547 (2.2X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 12.144 (1.0X) | N/A | 12.997 (0.9X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 79.079 (1.0X) | N/A | 56.203 (1.4X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.370 (1.0X) | N/A | 61.575 (0.6X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 179.988 (1.0X) | N/A | 185.029 (1.0X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.832 (1.0X) | N/A | 61.825 (0.5X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 177.995 (1.0X) | N/A | 189.356 (0.9X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 66.690 (1.0X) | N/A | 61.921 (1.1X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 490.584 (1.0X) | N/A | 214.159 (2.3X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.962 (1.0X) | N/A | 4.570 (1.1X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 25.806 (1.0X) | N/A | 17.619 (1.5X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 3.734 (1.0X) | N/A | 4.897 (0.8X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.734 (1.0X) | N/A | 11.279 (1.0X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.817 (1.0X) | N/A | 5.400 (1.1X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 21.579 (1.0X) | N/A | 11.902 (1.8X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.895 (1.0X) | N/A | 2.804 (1.0X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.782 (1.0X) | N/A | 2.668 (1.0X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.382 (1.0X) | N/A | 9.876 (0.8X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.242 (1.0X) | N/A | 31.150 (1.1X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.771 (1.0X) | N/A | 0.631 (1.2X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.701 (1.0X) | N/A | 0.567 (1.2X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.147 (1.0X) | N/A | 5.137 (0.8X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 17.437 (1.0X) | N/A | 18.972 (0.9X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 7.561 (1.0X) | N/A | 7.605 (1.0X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 49.180 (1.0X) | N/A | 44.312 (1.1X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 50.315 (1.0X) | N/A | 44.665 (1.1X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 30.412 (1.0X) | N/A | 28.170 (1.1X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 92.011 (1.0X) | N/A | 22.062 (4.2X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 93.825 (1.0X) | N/A | 21.955 (4.3X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 52.598 (1.0X) | N/A | 22.039 (2.4X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 139.700 (1.0X) | N/A | 27.176 (5.1X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 139.252 (1.0X) | N/A | 29.257 (4.8X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 71.020 (1.0X) | N/A | 26.294 (2.7X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 706.667 (1.0X) | N/A | 351.345 (2.0X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 705.630 (1.0X) | N/A | 359.599 (2.0X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 398.199 (1.0X) | N/A | 217.817 (1.8X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 1115.509 (1.0X) | N/A | 304.094 (3.7X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1116.746 (1.0X) | N/A | 304.360 (3.7X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 578.965 (1.0X) | N/A | 183.705 (3.2X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 2103.392 (1.0X) | N/A | 309.233 (6.8X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 2106.473 (1.0X) | N/A | 307.803 (6.8X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1132.599 (1.0X) | N/A | 185.793 (6.1X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 12.082 (1.0X) | N/A | 1.297 (9.3X) |
Regressed Latencies 🚩
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 139.252 (vs. 130.806, 6.46%↑) | 139.273 | 0.208 |
| GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 139.700 (vs. 131.366, 6.34%↑) | 139.555 | 0.437 |
Improved Latencies 🎉
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 27.176 (vs. 29.273, 7.16%↓) | 27.228 | 0.927 |
| GPT2\_117M\_TF\_1X1XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 21.955 (vs. 23.626, 7.07%↓) | 21.890 | 0.230 |
| MobileBertSquad\_int8(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,max-concurrency] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 70.351 (vs. 74.906, 6.08%↓) | 70.448 | 0.550 |
[Top 3 out of 8 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
Arent there any tests that would make it easeier to use
iree-flow-dispatch-region-creation-pipeline.THis matches all my expectations. So this looks good to me!
I tried to use it on a few tests, but it seems like the test are too brittle (probably the canonicalization step in the middle of the pipeline). When running with the pipeline, the output seemed functionally correct but the mlir was different enough to fail the tests.
I'm getting a little worried about how many steps we're growing here - it makes it trickier to follow (updating the logic now requires understanding half a dozen passes) and is a lot more IR walks/verification/etc. We need to make sure this is near the end and in the future work more to refactor passes if needed instead of inserting more.
I say this because I no longer understand how this works and will not be able to change. So only one comment then I guess LGTM!
I'm feeling the same thing. Honestly, I'm not able to follow all the changes about fusion. I think at some point, I won't be able to make changes without restudying all the pieces. It'd be good if we can have an overview doc of the end-state of the refactoring. So others (like me) can pick up the logics quickly when needed.
(note: we can put the simple writeup to IREE blog: https://iree.dev/community/blog/)
Unless I am mistaken, all of this is just moving things out of monolithic passes into separate passes. This should be an NFC change. So the number of steps isnt really changing, it is just making each step a separate pass instead of adding it to the same pass. Even earlier it was always walking the whole program multiple times within the same pass (there may be better solutions than walking the program many times, but there is a reason for the cadence as it was setup). So nothing changed there. At least as separate passes you can dump the IR after each step and see whats happening.
In terms of documenting what is happening, that is upto me. I was refactoring these passes, and I realized, I missed my last commit. I can add the documentation of what each step is doing as a follow up.
@benvanik could you unblock this. I can do some follow up cleanup.