iree icon indicating copy to clipboard operation
iree copied to clipboard

[Flow] Make the output indexing_map of elementwise ops identity.

Open hanhanW opened this issue 1 year ago • 7 comments

hanhanW avatar May 01 '24 23:05 hanhanW

Abbreviated Benchmark Summary

@ commit c717dc470bc8120e526d7fd9d225806104ec757d (vs. base 355f56b5588ddf89565a2bb3d2a65524262d5508)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 222.378 (1.0X) N/A 109.896 (2.0X)
BertLargePTBatch1(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 648.725 (1.0X) N/A 239.269 (2.7X)
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 684.076 (1.0X) N/A 229.009 (3.0X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 32.516 (1.0X) N/A 33.181 (1.0X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.931 (1.0X) N/A 8.560 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 259.274 (1.0X) N/A 235.795 (1.1X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.800 (1.0X) N/A 33.983 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 28.771 (1.0X) N/A 15.156 (1.9X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.924 (1.0X) N/A 5.273 (1.1X)
Falcon7bGptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 20386.278 (1.0X) N/A 3570.192 (5.7X)
Falcon7bInt4GptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 20454.522 (1.0X) N/A 3362.663 (6.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 70.194 (1.0X) N/A 40.323 (1.7X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.970 (1.0X) N/A 8.460 (1.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 87.035 (1.0X) N/A 42.318 (2.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 10.619 (1.0X) N/A 8.191 (1.3X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 76.405 (1.0X) N/A 62.726 (1.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.188 (1.0X) N/A 12.672 (1.0X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 181.328 (1.0X) N/A 187.974 (1.0X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.867 (1.0X) N/A 57.958 (0.6X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 175.345 (1.0X) N/A 193.041 (0.9X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.979 (1.0X) N/A 58.398 (0.6X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 480.370 (1.0X) N/A 217.324 (2.2X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 60.648 (1.0X) N/A 64.387 (0.9X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 28.306 (1.0X) N/A 18.290 (1.5X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.698 (1.0X) N/A 4.504 (1.0X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.689 (1.0X) N/A 12.581 (0.9X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.647 (1.0X) N/A 4.876 (0.7X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 21.482 (1.0X) N/A 14.051 (1.5X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.742 (1.0X) N/A 5.559 (1.0X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 2.994 (1.0X) N/A 3.101 (1.0X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 2.898 (1.0X) N/A 3.398 (0.9X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.217 (1.0X) N/A 32.624 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.393 (1.0X) N/A 9.503 (0.9X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.710 (1.0X) N/A 0.595 (1.2X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.773 (1.0X) N/A 0.663 (1.2X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 18.134 (1.0X) N/A 21.124 (0.9X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.170 (1.0X) N/A 5.240 (0.8X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.054 (1.0X) N/A 0.054 (1.0X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.042 (1.0X) N/A 0.021 (2.0X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 7.585 (1.0X) N/A 7.587 (1.0X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 6.700 (1.0X) N/A 1.965 (3.4X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 49.397 (1.0X) N/A 77.112 (0.6X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 50.985 (1.0X) N/A 77.727 (0.7X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 30.512 (1.0X) N/A 45.560 (0.7X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 93.074 (1.0X) N/A 21.366 (4.4X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 94.325 (1.0X) N/A 21.724 (4.3X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 52.791 (1.0X) N/A 21.643 (2.4X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 129.191 (1.0X) N/A 26.711 (4.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 143.189 (1.0X) N/A 29.058 (4.9X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 77.525 (1.0X) N/A 26.195 (3.0X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 699.227 (1.0X) N/A 358.705 (1.9X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 699.613 (1.0X) N/A 358.422 (2.0X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 393.486 (1.0X) N/A 218.369 (1.8X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 1047.240 (1.0X) N/A 256.905 (4.1X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 1048.238 (1.0X) N/A 246.161 (4.3X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 541.632 (1.0X) N/A 147.724 (3.7X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 2103.909 (1.0X) N/A 311.107 (6.8X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 2104.787 (1.0X) N/A 311.350 (6.8X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 1120.648 (1.0X) N/A 186.658 (6.0X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 0.080 (1.0X) N/A 0.016 (5.1X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 0.072 (1.0X) N/A 0.017 (4.3X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 11.914 (1.0X) N/A 1.422 (8.4X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 16.528 (1.0X) N/A 1.176 (14.1X)

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 29.058 (vs. 31.632, 8.14%↓) 29.300 0.930
MobileBertSquad\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 393.486 (vs. 426.254, 7.69%↓) 395.033 9.597
MobileBertSquad\_int8(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 246.161 (vs. 266.117, 7.50%↓) 245.026 4.000

[Top 3 out of 16 results showed]

Regressed Total Dispatch Sizes 🚩

Benchmark Name Total Dispatch Size (bytes)
MobileBertSquad\_int8(tflite) [riscv\_32-generic-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 2513324 (vs. 2089388, 20.29%↑)
MobileBertSquad\_fp32(tflite) [riscv\_64-generic-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 51112 (vs. 44632, 14.52%↑)
MobileBertSquad\_int8(tflite) [riscv\_64-generic-linux\_gnu-llvm\_cpu][default-flags,compile-stats] 2295240 (vs. 2009416, 14.22%↑)

Regressed Stream IR Dispatch Count (# of cmd.dispatch ops) 🚩

Benchmark Name Stream IR Dispatch Count (# of cmd.dispatch ops)
BertLargePTBatch1(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] 412 (vs. 388, 6.19%↑)
Falcon7bInt4GptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] 648 (vs. 616, 5.19%↑)
Falcon7bGptqPT(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] 780 (vs. 748, 4.28%↑)

Improved Stream IR Dispatch Count (# of cmd.dispatch ops) 🎉

Benchmark Name Stream IR Dispatch Count (# of cmd.dispatch ops)
BertLargePTBatch1(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] 652 (vs. 724, 9.94%↓)
BertLargePTBatch1(linalg) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats] 652 (vs. 724, 9.94%↓)

For more information:

Source Workflow Run

github-actions[bot] avatar May 02 '24 00:05 github-actions[bot]

Interesting result... We don't see regressions on other backends because we don't track them in our CI. Perhaps we should check if it regresses sdxl or not. Is @qedawkins the best person to check if there are regressions in sdxl model?

hanhanW avatar May 02 '24 00:05 hanhanW

The MI-250 benchmark seems to be maybe a little slower

before:

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_tokens_to_image/process_time/real_time              5776 ms         42.9 ms            1 items_per_second=0.173132/s
BM_tokens_to_image/process_time/real_time              5699 ms         36.6 ms            1 items_per_second=0.175465/s
BM_tokens_to_image/process_time/real_time              5673 ms         36.2 ms            1 items_per_second=0.176266/s
BM_tokens_to_image/process_time/real_time_mean         5716 ms         38.6 ms            3 items_per_second=0.174954/s
BM_tokens_to_image/process_time/real_time_median       5699 ms         36.6 ms            3 items_per_second=0.175465/s
BM_tokens_to_image/process_time/real_time_stddev       53.4 ms         3.74 ms            3 items_per_second=1.62797m/s
BM_tokens_to_image/process_time/real_time_cv           0.93 %          9.69 %             3 items_per_second=0.93%

after:

 -----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_tokens_to_image/process_time/real_time              5721 ms         37.8 ms            1 items_per_second=0.174806/s
BM_tokens_to_image/process_time/real_time              5721 ms         37.6 ms            1 items_per_second=0.174791/s
BM_tokens_to_image/process_time/real_time              5755 ms         37.6 ms            1 items_per_second=0.173758/s
BM_tokens_to_image/process_time/real_time_mean         5732 ms         37.7 ms            3 items_per_second=0.174452/s
BM_tokens_to_image/process_time/real_time_median       5721 ms         37.6 ms            3 items_per_second=0.174791/s
BM_tokens_to_image/process_time/real_time_stddev       19.8 ms        0.150 ms            3 items_per_second=601.121u/s
BM_tokens_to_image/process_time/real_time_cv           0.35 %          0.40 %             3 items_per_second=0.34%

Otherwise the way to check performance right now is for one of us to run it.

Edit: I pasted them in the wrong order. Fixed now.

qedawkins avatar May 02 '24 04:05 qedawkins

Interesting result... We don't see regressions on other backends because we don't track them in our CI. Perhaps we should check if it regresses sdxl or not. Is @qedawkins the best person to check if there are regressions in sdxl model?

MaheshRavishankar avatar May 02 '24 05:05 MaheshRavishankar

The MI-250 benchmark seems to be maybe a little slower

before:

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_tokens_to_image/process_time/real_time              5776 ms         42.9 ms            1 items_per_second=0.173132/s
BM_tokens_to_image/process_time/real_time              5699 ms         36.6 ms            1 items_per_second=0.175465/s
BM_tokens_to_image/process_time/real_time              5673 ms         36.2 ms            1 items_per_second=0.176266/s
BM_tokens_to_image/process_time/real_time_mean         5716 ms         38.6 ms            3 items_per_second=0.174954/s
BM_tokens_to_image/process_time/real_time_median       5699 ms         36.6 ms            3 items_per_second=0.175465/s
BM_tokens_to_image/process_time/real_time_stddev       53.4 ms         3.74 ms            3 items_per_second=1.62797m/s
BM_tokens_to_image/process_time/real_time_cv           0.93 %          9.69 %             3 items_per_second=0.93%

after:

 -----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_tokens_to_image/process_time/real_time              5721 ms         37.8 ms            1 items_per_second=0.174806/s
BM_tokens_to_image/process_time/real_time              5721 ms         37.6 ms            1 items_per_second=0.174791/s
BM_tokens_to_image/process_time/real_time              5755 ms         37.6 ms            1 items_per_second=0.173758/s
BM_tokens_to_image/process_time/real_time_mean         5732 ms         37.7 ms            3 items_per_second=0.174452/s
BM_tokens_to_image/process_time/real_time_median       5721 ms         37.6 ms            3 items_per_second=0.174791/s
BM_tokens_to_image/process_time/real_time_stddev       19.8 ms        0.150 ms            3 items_per_second=601.121u/s
BM_tokens_to_image/process_time/real_time_cv           0.35 %          0.40 %             3 items_per_second=0.34%

Otherwise the way to check performance right now is for one of us to run it.

Edit: I pasted them in the wrong order. Fixed now.

Is there a way to add dispatch count to sdxl CI . @ScottTodd of things, that is most important for now.

MaheshRavishankar avatar May 02 '24 05:05 MaheshRavishankar

(Sorry closed it by mistake, reopened it)

MaheshRavishankar avatar May 02 '24 05:05 MaheshRavishankar

Is there a way to add dispatch count to sdxl CI . @ScottTodd of things, that is most important for now.

Yeah. We can adapt what's in https://github.com/iree-org/iree/blob/main/build_tools/benchmarks/collect_compilation_statistics.py, or try to plug SDXL in to the existing in-tree benchmark suite. Will require a bit of planning.

ScottTodd avatar May 02 '24 15:05 ScottTodd