[Codegen] Do not consider parallel regions in bufferization analysis
When there is a buffer used inside of an scf.forall op that is defined outside of the scf.forall, bufferization will unconditionally bufferize out of place by default in order to avoid race conditions. However, handling parallel accesses to a buffer should generally be the responsibility of the source program, and if there is a race condition, then it should be handled outside of bufferization. This PR disables the parallel region check in IREE to simplify the bufferization analysis and enable more buffer reuse.
After looking into the failing test I don't think we are ready for this flip yet. We need a better way of handling shared memory.
The test failure is cause by 2 tensor.empty() ops getting tiled to the same size and the CSEd into a single empty. However, the empty ops are sort of a hacky way to represent the shared memory buffers when running GPUPromoteMatmulOperandsPass. We need a better representation of the shared memory buffers at tensor level so we don't CSE them into a single buffer. The parallelism check in bufferization has been saving us by recreating a new buffer, but it is not a good way to handle this issue.
A couple of ideas would be to create alloc_tensor ops or implement some multi-buffering with both shared memory allocs being split across a single tensor.
CC @MaheshRavishankar @qedawkins @antiagainst
After looking into the failing test I don't think we are ready for this flip yet. We need a better way of handling shared memory.
The test failure is cause by 2
tensor.empty()ops getting tiled to the same size and the CSEd into a single empty. However, the empty ops are sort of a hacky way to represent the shared memory buffers when runningGPUPromoteMatmulOperandsPass. We need a better representation of the shared memory buffers at tensor level so we don't CSE them into a single buffer. The parallelism check in bufferization has been saving us by recreating a new buffer, but it is not a good way to handle this issue.A couple of ideas would be to create alloc_tensor ops or implement some multi-buffering with both shared memory allocs being split across a single tensor.
CC @MaheshRavishankar @qedawkins @antiagainst
whereever the tensor.empty is created we could create a bufferization.alloc_tensor (or whatever) op. IIUC this op does not CSE. This was explicitly the reason it was split out from tensor.empty
https://github.com/iree-org/iree/pull/17940 fixes the test failure. Rebasing this PR on top of it for now.
Abbreviated Benchmark Summary
@ commit e38fae5421272fe594b7f0e1c09fa09543b50836 (no previous benchmark results to compare)
Data-Tiling Comparison Table
Click to show
| Name | No-DT (baseline) | DT-Only | DT-UK |
|---|---|---|---|
| BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 788.727 (1.0X) | N/A | 221.751 (3.6X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 6.973 (1.0X) | N/A | 8.491 (0.8X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 36.029 (1.0X) | N/A | 34.606 (1.0X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.816 (1.0X) | N/A | 5.033 (1.2X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 9.144 (1.0X) | N/A | 8.496 (1.1X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.037 (1.0X) | N/A | 9.005 (1.2X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.998 (1.0X) | N/A | 13.714 (0.9X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.706 (1.0X) | N/A | 61.553 (0.5X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.471 (1.0X) | N/A | 61.947 (0.6X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 68.923 (1.0X) | N/A | 65.935 (1.0X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.724 (1.0X) | N/A | 4.584 (1.0X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 3.748 (1.0X) | N/A | 4.873 (0.8X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.850 (1.0X) | N/A | 5.392 (1.1X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.940 (1.0X) | N/A | 2.814 (1.0X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.446 (1.0X) | N/A | 9.832 (0.9X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.777 (1.0X) | N/A | 0.610 (1.3X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.152 (1.0X) | N/A | 5.182 (0.8X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 7.581 (1.0X) | N/A | 7.573 (1.0X) |
| matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.377 (1.0X) | N/A | 1.807 (4.6X) |
| BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 218.638 (1.0X) | N/A | 108.259 (2.0X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 32.447 (1.0X) | N/A | 29.835 (1.1X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 275.887 (1.0X) | N/A | 229.400 (1.2X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 26.974 (1.0X) | N/A | 13.031 (2.1X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 70.679 (1.0X) | N/A | 37.560 (1.9X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 88.576 (1.0X) | N/A | 39.623 (2.2X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 80.567 (1.0X) | N/A | 55.983 (1.4X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 182.101 (1.0X) | N/A | 186.203 (1.0X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 181.315 (1.0X) | N/A | 191.411 (0.9X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 516.233 (1.0X) | N/A | 240.753 (2.1X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 25.036 (1.0X) | N/A | 17.616 (1.4X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.681 (1.0X) | N/A | 11.369 (1.0X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 21.497 (1.0X) | N/A | 11.784 (1.8X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.791 (1.0X) | N/A | 2.702 (1.0X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.799 (1.0X) | N/A | 30.969 (1.1X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.709 (1.0X) | N/A | 0.548 (1.3X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 17.478 (1.0X) | N/A | 19.371 (0.9X) |
| matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.054 (1.0X) | N/A | 0.054 (1.0X) |
| matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.042 (1.0X) | N/A | 0.021 (2.0X) |
Raw Latencies
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 221.751 | 222.448 | 1.726 |
| BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 788.727 | 780.481 | 36.115 |
| DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.491 | 8.492 | 0.018 |
[Top 3 out of 92 results showed]
No improved or regressed compilation metrics 🏖️
For more information: