[Flow] Make the output indexing_map of elementwise ops identity.
Abbreviated Benchmark Summary
@ commit c717dc470bc8120e526d7fd9d225806104ec757d (vs. base 355f56b5588ddf89565a2bb3d2a65524262d5508)
Data-Tiling Comparison Table
Click to show
| Name | No-DT (baseline) | DT-Only | DT-UK |
|---|---|---|---|
| BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 222.378 (1.0X) | N/A | 109.896 (2.0X) |
| BertLargePTBatch1(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 648.725 (1.0X) | N/A | 239.269 (2.7X) |
| BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 684.076 (1.0X) | N/A | 229.009 (3.0X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 32.516 (1.0X) | N/A | 33.181 (1.0X) |
| DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 6.931 (1.0X) | N/A | 8.560 (0.8X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 259.274 (1.0X) | N/A | 235.795 (1.1X) |
| EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.800 (1.0X) | N/A | 33.983 (1.0X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 28.771 (1.0X) | N/A | 15.156 (1.9X) |
| EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.924 (1.0X) | N/A | 5.273 (1.1X) |
| Falcon7bGptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 20386.278 (1.0X) | N/A | 3570.192 (5.7X) |
| Falcon7bInt4GptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 20454.522 (1.0X) | N/A | 3362.663 (6.1X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 70.194 (1.0X) | N/A | 40.323 (1.7X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.970 (1.0X) | N/A | 8.460 (1.1X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 87.035 (1.0X) | N/A | 42.318 (2.1X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 10.619 (1.0X) | N/A | 8.191 (1.3X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 76.405 (1.0X) | N/A | 62.726 (1.2X) |
| MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 12.188 (1.0X) | N/A | 12.672 (1.0X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 181.328 (1.0X) | N/A | 187.974 (1.0X) |
| MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.867 (1.0X) | N/A | 57.958 (0.6X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 175.345 (1.0X) | N/A | 193.041 (0.9X) |
| MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 33.979 (1.0X) | N/A | 58.398 (0.6X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 480.370 (1.0X) | N/A | 217.324 (2.2X) |
| MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 60.648 (1.0X) | N/A | 64.387 (0.9X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 28.306 (1.0X) | N/A | 18.290 (1.5X) |
| MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.698 (1.0X) | N/A | 4.504 (1.0X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 11.689 (1.0X) | N/A | 12.581 (0.9X) |
| MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 3.647 (1.0X) | N/A | 4.876 (0.7X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 21.482 (1.0X) | N/A | 14.051 (1.5X) |
| MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 5.742 (1.0X) | N/A | 5.559 (1.0X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.994 (1.0X) | N/A | 3.101 (1.0X) |
| MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 2.898 (1.0X) | N/A | 3.398 (0.9X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 34.217 (1.0X) | N/A | 32.624 (1.0X) |
| MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.393 (1.0X) | N/A | 9.503 (0.9X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.710 (1.0X) | N/A | 0.595 (1.2X) |
| PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.773 (1.0X) | N/A | 0.663 (1.2X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 18.134 (1.0X) | N/A | 21.124 (0.9X) |
| PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 4.170 (1.0X) | N/A | 5.240 (0.8X) |
| matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.054 (1.0X) | N/A | 0.054 (1.0X) |
| matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 0.042 (1.0X) | N/A | 0.021 (2.0X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 7.585 (1.0X) | N/A | 7.587 (1.0X) |
| matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] | 6.700 (1.0X) | N/A | 1.965 (3.4X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 49.397 (1.0X) | N/A | 77.112 (0.6X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 50.985 (1.0X) | N/A | 77.727 (0.7X) |
| DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 30.512 (1.0X) | N/A | 45.560 (0.7X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 93.074 (1.0X) | N/A | 21.366 (4.4X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 94.325 (1.0X) | N/A | 21.724 (4.3X) |
| GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 52.791 (1.0X) | N/A | 21.643 (2.4X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 129.191 (1.0X) | N/A | 26.711 (4.8X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 143.189 (1.0X) | N/A | 29.058 (4.9X) |
| GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 77.525 (1.0X) | N/A | 26.195 (3.0X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 699.227 (1.0X) | N/A | 358.705 (1.9X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 699.613 (1.0X) | N/A | 358.422 (2.0X) |
| MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 393.486 (1.0X) | N/A | 218.369 (1.8X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 1047.240 (1.0X) | N/A | 256.905 (4.1X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1048.238 (1.0X) | N/A | 246.161 (4.3X) |
| MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 541.632 (1.0X) | N/A | 147.724 (3.7X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 2103.909 (1.0X) | N/A | 311.107 (6.8X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 2104.787 (1.0X) | N/A | 311.350 (6.8X) |
| Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1120.648 (1.0X) | N/A | 186.658 (6.0X) |
| matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 0.080 (1.0X) | N/A | 0.016 (5.1X) |
| matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 0.072 (1.0X) | N/A | 0.017 (4.3X) |
| matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 11.914 (1.0X) | N/A | 1.422 (8.4X) |
| matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 16.528 (1.0X) | N/A | 1.176 (14.1X) |
Improved Latencies 🎉
| Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
|---|---|---|---|
| GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 29.058 (vs. 31.632, 8.14%↓) | 29.300 | 0.930 |
| MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 393.486 (vs. 426.254, 7.69%↓) | 395.033 | 9.597 |
| MobileBertSquad\_int8(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 246.161 (vs. 266.117, 7.50%↓) | 245.026 | 4.000 |
[Top 3 out of 16 results showed]
Regressed Total Dispatch Sizes 🚩
| Benchmark Name | Total Dispatch Size (bytes) |
|---|---|
| MobileBertSquad\_int8(tflite) [riscv\_32-generic-linux\_gnu-llvm\_cpu][default-flags,compile-stats] | 2513324 (vs. 2089388, 20.29%↑) |
| MobileBertSquad\_fp32(tflite) [riscv\_64-generic-linux\_gnu-llvm\_cpu][default-flags,compile-stats] | 51112 (vs. 44632, 14.52%↑) |
| MobileBertSquad\_int8(tflite) [riscv\_64-generic-linux\_gnu-llvm\_cpu][default-flags,compile-stats] | 2295240 (vs. 2009416, 14.22%↑) |
Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩
| Benchmark Name | Stream IR Dispatch Count (# of cmd.dispatch ops) |
|---|---|
| BertLargePTBatch1(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] | 412 (vs. 388, 6.19%↑) |
| Falcon7bInt4GptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] | 648 (vs. 616, 5.19%↑) |
| Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] | 780 (vs. 748, 4.28%↑) |
Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉
| Benchmark Name | Stream IR Dispatch Count (# of cmd.dispatch ops) |
|---|---|
| BertLargePTBatch1(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] | 652 (vs. 724, 9.94%↓) |
| BertLargePTBatch1(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats] | 652 (vs. 724, 9.94%↓) |
For more information:
Interesting result... We don't see regressions on other backends because we don't track them in our CI. Perhaps we should check if it regresses sdxl or not. Is @qedawkins the best person to check if there are regressions in sdxl model?
The MI-250 benchmark seems to be maybe a little slower
before:
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_tokens_to_image/process_time/real_time 5776 ms 42.9 ms 1 items_per_second=0.173132/s
BM_tokens_to_image/process_time/real_time 5699 ms 36.6 ms 1 items_per_second=0.175465/s
BM_tokens_to_image/process_time/real_time 5673 ms 36.2 ms 1 items_per_second=0.176266/s
BM_tokens_to_image/process_time/real_time_mean 5716 ms 38.6 ms 3 items_per_second=0.174954/s
BM_tokens_to_image/process_time/real_time_median 5699 ms 36.6 ms 3 items_per_second=0.175465/s
BM_tokens_to_image/process_time/real_time_stddev 53.4 ms 3.74 ms 3 items_per_second=1.62797m/s
BM_tokens_to_image/process_time/real_time_cv 0.93 % 9.69 % 3 items_per_second=0.93%
after:
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_tokens_to_image/process_time/real_time 5721 ms 37.8 ms 1 items_per_second=0.174806/s
BM_tokens_to_image/process_time/real_time 5721 ms 37.6 ms 1 items_per_second=0.174791/s
BM_tokens_to_image/process_time/real_time 5755 ms 37.6 ms 1 items_per_second=0.173758/s
BM_tokens_to_image/process_time/real_time_mean 5732 ms 37.7 ms 3 items_per_second=0.174452/s
BM_tokens_to_image/process_time/real_time_median 5721 ms 37.6 ms 3 items_per_second=0.174791/s
BM_tokens_to_image/process_time/real_time_stddev 19.8 ms 0.150 ms 3 items_per_second=601.121u/s
BM_tokens_to_image/process_time/real_time_cv 0.35 % 0.40 % 3 items_per_second=0.34%
Otherwise the way to check performance right now is for one of us to run it.
Edit: I pasted them in the wrong order. Fixed now.
Interesting result... We don't see regressions on other backends because we don't track them in our CI. Perhaps we should check if it regresses sdxl or not. Is @qedawkins the best person to check if there are regressions in sdxl model?
The MI-250 benchmark seems to be maybe a little slower
before:
----------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------- BM_tokens_to_image/process_time/real_time 5776 ms 42.9 ms 1 items_per_second=0.173132/s BM_tokens_to_image/process_time/real_time 5699 ms 36.6 ms 1 items_per_second=0.175465/s BM_tokens_to_image/process_time/real_time 5673 ms 36.2 ms 1 items_per_second=0.176266/s BM_tokens_to_image/process_time/real_time_mean 5716 ms 38.6 ms 3 items_per_second=0.174954/s BM_tokens_to_image/process_time/real_time_median 5699 ms 36.6 ms 3 items_per_second=0.175465/s BM_tokens_to_image/process_time/real_time_stddev 53.4 ms 3.74 ms 3 items_per_second=1.62797m/s BM_tokens_to_image/process_time/real_time_cv 0.93 % 9.69 % 3 items_per_second=0.93%after:
----------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------- BM_tokens_to_image/process_time/real_time 5721 ms 37.8 ms 1 items_per_second=0.174806/s BM_tokens_to_image/process_time/real_time 5721 ms 37.6 ms 1 items_per_second=0.174791/s BM_tokens_to_image/process_time/real_time 5755 ms 37.6 ms 1 items_per_second=0.173758/s BM_tokens_to_image/process_time/real_time_mean 5732 ms 37.7 ms 3 items_per_second=0.174452/s BM_tokens_to_image/process_time/real_time_median 5721 ms 37.6 ms 3 items_per_second=0.174791/s BM_tokens_to_image/process_time/real_time_stddev 19.8 ms 0.150 ms 3 items_per_second=601.121u/s BM_tokens_to_image/process_time/real_time_cv 0.35 % 0.40 % 3 items_per_second=0.34%Otherwise the way to check performance right now is for one of us to run it.
Edit: I pasted them in the wrong order. Fixed now.
Is there a way to add dispatch count to sdxl CI . @ScottTodd of things, that is most important for now.
(Sorry closed it by mistake, reopened it)
Is there a way to add dispatch count to sdxl CI . @ScottTodd of things, that is most important for now.
Yeah. We can adapt what's in https://github.com/iree-org/iree/blob/main/build_tools/benchmarks/collect_compilation_statistics.py, or try to plug SDXL in to the existing in-tree benchmark suite. Will require a bit of planning.