Fuse Generic Ops Generated by `gather` Lowering
https://github.com/iree-org/iree/issues/17226#issuecomment-2093718610
Abbreviated Benchmark Summary
@ commit 14d5d644869c3413c979f1f7e5bff3a96558d7f8 (vs. base 428adf2fe6c3318ad1a6f8c6ae945f700802323e)
Data-Tiling Comparison Table
Click to show
| Name | No-DT (baseline) | DT-Only | DT-UK |
|---|---|---|---|
| BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 229.620 (1.0X) | N/A | 113.440 (2.0X) |
| BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 718.911 (1.0X) | N/A | 231.435 (3.1X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 7.042 (1.0X) | N/A | 8.602 (0.8X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 32.121 (1.0X) | N/A | 33.562 (1.0X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.528 (1.0X) | N/A | 33.727 (1.0X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 266.210 (1.0X) | N/A | 233.687 (1.1X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.904 (1.0X) | N/A | 5.290 (1.1X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 28.616 (1.0X) | N/A | 15.424 (1.9X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.909 (1.0X) | N/A | 8.533 (1.0X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 70.081 (1.0X) | N/A | 39.435 (1.8X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 10.532 (1.0X) | N/A | 8.255 (1.3X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 88.751 (1.0X) | N/A | 41.536 (2.1X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 12.092 (1.0X) | N/A | 12.987 (0.9X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 79.397 (1.0X) | N/A | 62.312 (1.3X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.744 (1.0X) | N/A | 57.934 (0.6X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 179.842 (1.0X) | N/A | 187.279 (1.0X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.434 (1.0X) | N/A | 58.462 (0.6X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 180.228 (1.0X) | N/A | 192.676 (0.9X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 60.298 (1.0X) | N/A | 63.623 (0.9X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 481.252 (1.0X) | N/A | 213.240 (2.3X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.889 (1.0X) | N/A | 4.520 (1.1X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 27.250 (1.0X) | N/A | 18.371 (1.5X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 3.680 (1.0X) | N/A | 4.927 (0.7X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.800 (1.0X) | N/A | 12.382 (1.0X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.762 (1.0X) | N/A | 5.600 (1.0X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 21.470 (1.0X) | N/A | 13.936 (1.5X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.901 (1.0X) | N/A | 3.234 (0.9X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.764 (1.0X) | N/A | 3.130 (0.9X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.463 (1.0X) | N/A | 9.582 (0.9X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.182 (1.0X) | N/A | 32.811 (1.0X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.775 (1.0X) | N/A | 0.660 (1.2X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.708 (1.0X) | N/A | 0.596 (1.2X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.114 (1.0X) | N/A | 5.276 (0.8X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 17.588 (1.0X) | N/A | 20.969 (0.8X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 7.594 (1.0X) | N/A | 7.592 (1.0X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 49.331 (1.0X) | N/A | 77.538 (0.6X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 50.491 (1.0X) | N/A | 78.137 (0.6X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 30.345 (1.0X) | N/A | 46.240 (0.7X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 92.677 (1.0X) | N/A | 21.035 (4.4X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 93.827 (1.0X) | N/A | 21.949 (4.3X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 52.577 (1.0X) | N/A | 21.812 (2.4X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 133.041 (1.0X) | N/A | 27.282 (4.9X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 139.479 (1.0X) | N/A | 28.952 (4.8X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 75.378 (1.0X) | N/A | 26.234 (2.9X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 690.395 (1.0X) | N/A | 365.637 (1.9X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 701.393 (1.0X) | N/A | 363.567 (1.9X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 395.483 (1.0X) | N/A | 223.156 (1.8X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 1046.193 (1.0X) | N/A | 257.241 (4.1X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1047.262 (1.0X) | N/A | 257.292 (4.1X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 543.116 (1.0X) | N/A | 151.883 (3.6X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 2098.163 (1.0X) | N/A | 303.333 (6.9X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 2099.427 (1.0X) | N/A | 307.269 (6.8X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1118.831 (1.0X) | N/A | 182.836 (6.1X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 12.169 (1.0X) | N/A | 1.438 (8.5X) |
Regressed Latencies 🚩
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| MobileBertSquad\_fp16(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,max-concurrency,demote-f32-to-f16] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 105.460 (vs. 96.287, 9.53%↑) | 107.809 | 5.163 |
| GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 139.479 (vs. 130.412, 6.95%↑) | 139.381 | 0.475 |
Improved Latencies 🎉
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| MobileBertSquad\_int8(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][default-flags] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 81.551 (vs. 95.033, 14.19%↓) | 81.556 | 0.977 |
| MobileBertSquad\_int8(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,max-concurrency] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 69.876 (vs. 76.311, 8.43%↓) | 69.706 | 0.753 |
| GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 28.952 (vs. 30.716, 5.74%↓) | 29.174 | 0.704 |
[Top 3 out of 4 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
@MaheshRavishankar, I added the check for the number of uses because inlineBlockBefore erases the block. But, I updated it to clone the block first so the generic is preserved
I changed matchAndRewrite to match against ExtractOp because it seemed more natural. There are 2 main complications
- Extract ops cannot be erased until iteration is finished (this was causing CI failures). Having
ExtractOpas the root lets the passes do the iteration - The extract ops can be matched independently. Some can fail and others can be rewritten successfully. Of the two options (return success when only some of the extracts have been matched OR fail when not all have been matched) neither seem favorable.
@hanhanW I think we should incorporate @benvanik 's comment https://github.com/llvm/torch-mlir/pull/3277#issuecomment-2099387732
@hanhanW I think we should incorporate @benvanik 's comment llvm/torch-mlir#3277 (comment)
Yes, but that is not something we can control. This needs to be fixed higher up in the stack then.