add specialized InList implementations for common scalar types
Closes #18824
@alamb maybe let's run benchmarks?
Here's what I'm seeing so far:
| Function & State | main (us) | specialized (us) | Change |
|---|---|---|---|
| in_list_utf8(5) (1024, 0) IN (1, 0) | 3.6205 | 4.8594 | Regressed (+34.239%) |
| in_list_utf8(10) (1024, 0) IN (1, 0) | 3.6363 | 5.1001 | Regressed (+40.312%) |
| in_list_utf8(20) (1024, 0) IN (1, 0) | 3.6707 | 4.9045 | Regressed (+33.600%) |
| in_list_f32 (1024, 0) IN (1, 0) | 3.5255 | 3.2288 | Improved (−8.2971%) |
| in_list_i32 (1024, 0) IN (1, 0) | 3.4923 | 3.6403 | Regressed (+3.8772%) |
| in_list_utf8(5) (1024, 0.2) IN (1, 0) | 4.9284 | 6.3583 | Regressed (+29.002%) |
| in_list_utf8(10) (1024, 0.2) IN (1, 0) | 4.9788 | 6.3273 | Regressed (+27.064%) |
| in_list_utf8(20) (1024, 0.2) IN (1, 0) | 5.1145 | 6.4905 | Regressed (+26.907%) |
| in_list_f32 (1024, 0.2) IN (1, 0) | 4.4894 | 4.2540 | Improved (−5.2512%) |
| in_list_i32 (1024, 0.2) IN (1, 0) | 4.4055 | 4.2869 | Improved (−2.7526%) |
| in_list_utf8(5) (1024, 0) IN (3, 0) | 3.5576 | 4.8733 | Regressed (+37.000%) |
| in_list_utf8(10) (1024, 0) IN (3, 0) | 3.4999 | 4.8737 | Regressed (+37.108%) |
| in_list_utf8(20) (1024, 0) IN (3, 0) | 3.4897 | 4.9039 | Regressed (+40.449%) |
| in_list_f32 (1024, 0) IN (3, 0) | 3.5114 | 3.2550 | Improved (−7.3270%) |
| in_list_i32 (1024, 0) IN (3, 0) | 3.5199 | 3.4879 | Noise Threshold (−0.9158%) |
| in_list_utf8(5) (1024, 0.2) IN (3, 0) | 5.0196 | 6.2232 | Regressed (+23.865%) |
| in_list_utf8(10) (1024, 0.2) IN (3, 0) | 5.0453 | 6.5280 | Regressed (+28.004%) |
| in_list_utf8(20) (1024, 0.2) IN (3, 0) | 5.1005 | 11.124 | Regressed (+118.13%) |
| in_list_f32 (1024, 0.2) IN (3, 0) | 4.5049 | 4.3126 | Improved (−4.6367%) |
| in_list_i32 (1024, 0.2) IN (3, 0) | 4.4026 | 4.4424 | Noise Threshold (+1.0449%) |
| in_list_utf8(5) (1024, 0) IN (10, 0) | 3.5275 | 4.9709 | Regressed (+40.969%) |
| in_list_utf8(10) (1024, 0) IN (10, 0) | 3.5613 | 4.9125 | Regressed (+37.386%) |
| in_list_utf8(20) (1024, 0) IN (10, 0) | 3.5589 | 4.8611 | Regressed (+36.578%) |
| in_list_f32 (1024, 0) IN (10, 0) | 3.5596 | 3.2017 | Improved (−10.003%) |
| in_list_i32 (1024, 0) IN (10, 0) | 3.4917 | 3.5190 | Noise Threshold (+1.0216%) |
| in_list_utf8(5) (1024, 0.2) IN (10, 0) | 5.0368 | 6.6162 | Regressed (+31.725%) |
| in_list_utf8(10) (1024, 0.2) IN (10, 0) | 5.0980 | 6.6660 | Regressed (+30.039%) |
| in_list_utf8(20) (1024, 0.2) IN (10, 0) | 5.2543 | 6.5350 | Regressed (+24.335%) |
| in_list_f32 (1024, 0.2) IN (10, 0) | 4.5609 | 4.3248 | Improved (−5.3746%) |
| in_list_i32 (1024, 0.2) IN (10, 0) | 4.4460 | 4.3543 | Improved (−2.7754%) |
| in_list_utf8(5) (1024, 0) IN (100, 0) | 3.6542 | 4.9992 | Regressed (+36.952%) |
| in_list_utf8(10) (1024, 0) IN (100, 0) | 3.6529 | 4.8772 | Regressed (+33.560%) |
| in_list_utf8(20) (1024, 0) IN (100, 0) | 3.6155 | 5.0250 | Regressed (+39.127%) |
| in_list_f32 (1024, 0) IN (100, 0) | 3.6029 | 3.2448 | Improved (−9.7603%) |
| in_list_i32 (1024, 0) IN (100, 0) | 3.6048 | 3.4770 | Improved (−3.5307%) |
| in_list_utf8(5) (1024, 0.2) IN (100, 0) | 5.3988 | 6.6108 | Regressed (+22.866%) |
| in_list_utf8(10) (1024, 0.2) IN (100, 0) | 5.4776 | 6.6567 | Regressed (+21.591%) |
| in_list_utf8(20) (1024, 0.2) IN (100, 0) | 5.3721 | 6.6470 | Regressed (+24.058%) |
| in_list_f32 (1024, 0.2) IN (100, 0) | 4.7693 | 4.3281 | Improved (−9.9814%) |
| in_list_i32 (1024, 0.2) IN (100, 0) | 4.5402 | 4.4076 | Improved (−3.5247%) |
I think we'd need to add benchmarks for other primitive types. And it's interesting that Utf8 regresses a lot across the board. I guess the benefits of vectorization / computing the entire hashes at once outweighs the dynamic dispatch?
Thinking about it the trick is probably to avoid the extra to_string()/to_vec() allocation and store just the hashes. Will look into that.
🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing specialize (1e4782f1cde36a83398002a4730ff96123b547a3) to 7d8b8602ad1be2f61f6a8ebb253ace9d85304ea7 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete
🤖: Benchmark completed
Details
Comparing HEAD and specialize
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query ┃ HEAD ┃ specialize ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2653.29 ms │ 2643.63 ms │ no change │
│ QQuery 1 │ 1306.60 ms │ 1281.59 ms │ no change │
│ QQuery 2 │ 2455.20 ms │ 2370.32 ms │ no change │
│ QQuery 3 │ 1156.08 ms │ 1114.79 ms │ no change │
│ QQuery 4 │ 2330.46 ms │ 2304.88 ms │ no change │
│ QQuery 5 │ 28126.46 ms │ 28010.99 ms │ no change │
│ QQuery 6 │ 4067.69 ms │ 4060.48 ms │ no change │
│ QQuery 7 │ 3638.82 ms │ 3851.00 ms │ 1.06x slower │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD) │ 45734.60ms │
│ Total Time (specialize) │ 45637.68ms │
│ Average Time (HEAD) │ 5716.82ms │
│ Average Time (specialize) │ 5704.71ms │
│ Queries Faster │ 0 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 7 │
│ Queries with Failure │ 0 │
└───────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ HEAD ┃ specialize ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2.51 ms │ 2.23 ms │ +1.13x faster │
│ QQuery 1 │ 51.64 ms │ 50.17 ms │ no change │
│ QQuery 2 │ 136.71 ms │ 138.99 ms │ no change │
│ QQuery 3 │ 164.21 ms │ 165.13 ms │ no change │
│ QQuery 4 │ 1070.98 ms │ 1088.91 ms │ no change │
│ QQuery 5 │ 1510.36 ms │ 1485.90 ms │ no change │
│ QQuery 6 │ 2.17 ms │ 2.20 ms │ no change │
│ QQuery 7 │ 55.51 ms │ 55.89 ms │ no change │
│ QQuery 8 │ 1426.35 ms │ 1446.26 ms │ no change │
│ QQuery 9 │ 1872.18 ms │ 1829.87 ms │ no change │
│ QQuery 10 │ 380.85 ms │ 367.57 ms │ no change │
│ QQuery 11 │ 458.71 ms │ 417.23 ms │ +1.10x faster │
│ QQuery 12 │ 1408.78 ms │ 1382.73 ms │ no change │
│ QQuery 13 │ 2081.78 ms │ 2148.16 ms │ no change │
│ QQuery 14 │ 1242.39 ms │ 1286.95 ms │ no change │
│ QQuery 15 │ 1216.37 ms │ 1262.77 ms │ no change │
│ QQuery 16 │ 2679.04 ms │ 2730.67 ms │ no change │
│ QQuery 17 │ 2657.52 ms │ 2666.04 ms │ no change │
│ QQuery 18 │ 5347.62 ms │ 4996.68 ms │ +1.07x faster │
│ QQuery 19 │ 129.19 ms │ 126.78 ms │ no change │
│ QQuery 20 │ 2020.18 ms │ 1974.03 ms │ no change │
│ QQuery 21 │ 2330.14 ms │ 2291.15 ms │ no change │
│ QQuery 22 │ 3990.08 ms │ 3921.02 ms │ no change │
│ QQuery 23 │ 12864.85 ms │ 12678.44 ms │ no change │
│ QQuery 24 │ 214.91 ms │ 197.85 ms │ +1.09x faster │
│ QQuery 25 │ 477.06 ms │ 465.64 ms │ no change │
│ QQuery 26 │ 221.72 ms │ 211.14 ms │ no change │
│ QQuery 27 │ 2818.62 ms │ 2800.08 ms │ no change │
│ QQuery 28 │ 23386.04 ms │ 24032.36 ms │ no change │
│ QQuery 29 │ 963.31 ms │ 964.57 ms │ no change │
│ QQuery 30 │ 1318.41 ms │ 1329.87 ms │ no change │
│ QQuery 31 │ 1378.81 ms │ 1352.16 ms │ no change │
│ QQuery 32 │ 4881.46 ms │ 5040.19 ms │ no change │
│ QQuery 33 │ 5905.32 ms │ 5981.60 ms │ no change │
│ QQuery 34 │ 6119.86 ms │ 6348.76 ms │ no change │
│ QQuery 35 │ 1942.66 ms │ 1904.86 ms │ no change │
│ QQuery 36 │ 120.95 ms │ 120.81 ms │ no change │
│ QQuery 37 │ 52.95 ms │ 53.06 ms │ no change │
│ QQuery 38 │ 121.94 ms │ 120.66 ms │ no change │
│ QQuery 39 │ 196.99 ms │ 198.57 ms │ no change │
│ QQuery 40 │ 43.90 ms │ 41.87 ms │ no change │
│ QQuery 41 │ 39.75 ms │ 37.46 ms │ +1.06x faster │
│ QQuery 42 │ 32.87 ms │ 31.87 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD) │ 95337.65ms │
│ Total Time (specialize) │ 95749.15ms │
│ Average Time (HEAD) │ 2217.15ms │
│ Average Time (specialize) │ 2226.72ms │
│ Queries Faster │ 5 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 38 │
│ Queries with Failure │ 0 │
└───────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ HEAD ┃ specialize ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 139.55 ms │ 138.68 ms │ no change │
│ QQuery 2 │ 27.47 ms │ 28.88 ms │ 1.05x slower │
│ QQuery 3 │ 38.06 ms │ 37.65 ms │ no change │
│ QQuery 4 │ 29.40 ms │ 28.38 ms │ no change │
│ QQuery 5 │ 87.81 ms │ 86.86 ms │ no change │
│ QQuery 6 │ 19.52 ms │ 19.58 ms │ no change │
│ QQuery 7 │ 222.64 ms │ 227.15 ms │ no change │
│ QQuery 8 │ 34.19 ms │ 35.26 ms │ no change │
│ QQuery 9 │ 102.37 ms │ 105.70 ms │ no change │
│ QQuery 10 │ 62.63 ms │ 64.92 ms │ no change │
│ QQuery 11 │ 18.73 ms │ 18.59 ms │ no change │
│ QQuery 12 │ 51.03 ms │ 52.34 ms │ no change │
│ QQuery 13 │ 48.65 ms │ 48.66 ms │ no change │
│ QQuery 14 │ 13.94 ms │ 13.77 ms │ no change │
│ QQuery 15 │ 25.10 ms │ 24.91 ms │ no change │
│ QQuery 16 │ 25.28 ms │ 24.76 ms │ no change │
│ QQuery 17 │ 153.69 ms │ 150.64 ms │ no change │
│ QQuery 18 │ 277.48 ms │ 281.61 ms │ no change │
│ QQuery 19 │ 36.61 ms │ 36.42 ms │ no change │
│ QQuery 20 │ 50.79 ms │ 48.40 ms │ no change │
│ QQuery 21 │ 338.03 ms │ 318.12 ms │ +1.06x faster │
│ QQuery 22 │ 20.42 ms │ 17.44 ms │ +1.17x faster │
└──────────────┴───────────┴────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD) │ 1823.38ms │
│ Total Time (specialize) │ 1808.75ms │
│ Average Time (HEAD) │ 82.88ms │
│ Average Time (specialize) │ 82.22ms │
│ Queries Faster │ 2 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 19 │
│ Queries with Failure │ 0 │
└───────────────────────────┴───────────┘
🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing specialize (1e4782f1cde36a83398002a4730ff96123b547a3) to 7d8b8602ad1be2f61f6a8ebb253ace9d85304ea7 diff
BENCH_NAME=in_list
BENCH_COMMAND=cargo bench --bench in_list
BENCH_FILTER=
BENCH_BRANCH_NAME=specialize
Results will be posted here when complete
🤖: Benchmark completed
Details
group main specialize
----- ---- ----------
in_list_f32 (1024, 0) IN (1, 0) 1.16 4.2±0.01µs ? ?/sec 1.00 3.6±0.02µs ? ?/sec
in_list_f32 (1024, 0) IN (10, 0) 1.16 4.2±0.01µs ? ?/sec 1.00 3.6±0.02µs ? ?/sec
in_list_f32 (1024, 0) IN (100, 0) 1.16 4.3±0.01µs ? ?/sec 1.00 3.7±0.01µs ? ?/sec
in_list_f32 (1024, 0) IN (3, 0) 1.17 4.3±0.10µs ? ?/sec 1.00 3.6±0.01µs ? ?/sec
in_list_f32 (1024, 0.2) IN (1, 0) 1.00 5.1±0.02µs ? ?/sec 1.13 5.8±0.07µs ? ?/sec
in_list_f32 (1024, 0.2) IN (10, 0) 1.00 5.3±0.03µs ? ?/sec 1.09 5.7±0.02µs ? ?/sec
in_list_f32 (1024, 0.2) IN (100, 0) 1.00 5.3±0.04µs ? ?/sec 1.13 5.9±0.10µs ? ?/sec
in_list_f32 (1024, 0.2) IN (3, 0) 1.00 5.2±0.04µs ? ?/sec 1.12 5.8±0.05µs ? ?/sec
in_list_i32 (1024, 0) IN (1, 0) 1.00 4.3±0.01µs ? ?/sec 1.11 4.7±0.02µs ? ?/sec
in_list_i32 (1024, 0) IN (10, 0) 1.00 4.2±0.01µs ? ?/sec 1.11 4.7±0.04µs ? ?/sec
in_list_i32 (1024, 0) IN (100, 0) 1.00 4.3±0.01µs ? ?/sec 1.11 4.7±0.01µs ? ?/sec
in_list_i32 (1024, 0) IN (3, 0) 1.00 4.3±0.01µs ? ?/sec 1.11 4.7±0.07µs ? ?/sec
in_list_i32 (1024, 0.2) IN (1, 0) 1.00 5.2±0.02µs ? ?/sec 1.32 6.8±0.39µs ? ?/sec
in_list_i32 (1024, 0.2) IN (10, 0) 1.00 5.2±0.02µs ? ?/sec 1.25 6.5±0.03µs ? ?/sec
in_list_i32 (1024, 0.2) IN (100, 0) 1.00 5.2±0.03µs ? ?/sec 1.16 6.1±0.06µs ? ?/sec
in_list_i32 (1024, 0.2) IN (3, 0) 1.00 5.2±0.02µs ? ?/sec 1.30 6.8±0.03µs ? ?/sec
in_list_utf8(10) (1024, 0) IN (1, 0) 1.16 4.5±0.02µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(10) (1024, 0) IN (10, 0) 1.16 4.5±0.01µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(10) (1024, 0) IN (100, 0) 1.16 4.5±0.03µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(10) (1024, 0) IN (3, 0) 1.16 4.5±0.04µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (1, 0) 1.06 5.9±0.03µs ? ?/sec 1.00 5.6±0.04µs ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (10, 0) 1.09 6.0±0.06µs ? ?/sec 1.00 5.5±0.03µs ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (100, 0) 1.07 6.0±0.03µs ? ?/sec 1.00 5.6±0.03µs ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (3, 0) 1.10 6.1±0.04µs ? ?/sec 1.00 5.6±0.03µs ? ?/sec
in_list_utf8(20) (1024, 0) IN (1, 0) 1.16 4.5±0.02µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(20) (1024, 0) IN (10, 0) 1.16 4.5±0.01µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(20) (1024, 0) IN (100, 0) 1.15 4.5±0.04µs ? ?/sec 1.00 3.9±0.03µs ? ?/sec
in_list_utf8(20) (1024, 0) IN (3, 0) 1.14 4.6±0.01µs ? ?/sec 1.00 4.0±0.03µs ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (1, 0) 1.06 6.0±0.02µs ? ?/sec 1.00 5.7±0.03µs ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (10, 0) 1.07 6.0±0.03µs ? ?/sec 1.00 5.6±0.03µs ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (100, 0) 1.09 6.1±0.03µs ? ?/sec 1.00 5.6±0.02µs ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (3, 0) 1.04 6.1±0.03µs ? ?/sec 1.00 5.8±0.05µs ? ?/sec
in_list_utf8(5) (1024, 0) IN (1, 0) 1.15 4.5±0.01µs ? ?/sec 1.00 4.0±0.01µs ? ?/sec
in_list_utf8(5) (1024, 0) IN (10, 0) 1.16 4.5±0.03µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec
in_list_utf8(5) (1024, 0) IN (100, 0) 1.16 4.5±0.01µs ? ?/sec 1.00 3.9±0.00µs ? ?/sec
in_list_utf8(5) (1024, 0) IN (3, 0) 1.14 4.5±0.03µs ? ?/sec 1.00 4.0±0.02µs ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (1, 0) 1.06 5.9±0.04µs ? ?/sec 1.00 5.5±0.04µs ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (10, 0) 1.10 6.1±0.04µs ? ?/sec 1.00 5.5±0.03µs ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (100, 0) 1.11 6.0±0.03µs ? ?/sec 1.00 5.4±0.03µs ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (3, 0) 1.11 6.1±0.06µs ? ?/sec 1.00 5.5±0.04µs ? ?/sec
🤖: Benchmark completed
Details
group main specialize ----- ---- ---------- in_list_f32 (1024, 0) IN (1, 0) 1.16 4.2±0.01µs ? ?/sec 1.00 3.6±0.02µs ? ?/sec in_list_f32 (1024, 0) IN (10, 0) 1.16 4.2±0.01µs ? ?/sec 1.00 3.6±0.02µs ? ?/sec in_list_f32 (1024, 0) IN (100, 0) 1.16 4.3±0.01µs ? ?/sec 1.00 3.7±0.01µs ? ?/sec in_list_f32 (1024, 0) IN (3, 0) 1.17 4.3±0.10µs ? ?/sec 1.00 3.6±0.01µs ? ?/sec in_list_f32 (1024, 0.2) IN (1, 0) 1.00 5.1±0.02µs ? ?/sec 1.13 5.8±0.07µs ? ?/sec in_list_f32 (1024, 0.2) IN (10, 0) 1.00 5.3±0.03µs ? ?/sec 1.09 5.7±0.02µs ? ?/sec in_list_f32 (1024, 0.2) IN (100, 0) 1.00 5.3±0.04µs ? ?/sec 1.13 5.9±0.10µs ? ?/sec in_list_f32 (1024, 0.2) IN (3, 0) 1.00 5.2±0.04µs ? ?/sec 1.12 5.8±0.05µs ? ?/sec in_list_i32 (1024, 0) IN (1, 0) 1.00 4.3±0.01µs ? ?/sec 1.11 4.7±0.02µs ? ?/sec in_list_i32 (1024, 0) IN (10, 0) 1.00 4.2±0.01µs ? ?/sec 1.11 4.7±0.04µs ? ?/sec in_list_i32 (1024, 0) IN (100, 0) 1.00 4.3±0.01µs ? ?/sec 1.11 4.7±0.01µs ? ?/sec in_list_i32 (1024, 0) IN (3, 0) 1.00 4.3±0.01µs ? ?/sec 1.11 4.7±0.07µs ? ?/sec in_list_i32 (1024, 0.2) IN (1, 0) 1.00 5.2±0.02µs ? ?/sec 1.32 6.8±0.39µs ? ?/sec in_list_i32 (1024, 0.2) IN (10, 0) 1.00 5.2±0.02µs ? ?/sec 1.25 6.5±0.03µs ? ?/sec in_list_i32 (1024, 0.2) IN (100, 0) 1.00 5.2±0.03µs ? ?/sec 1.16 6.1±0.06µs ? ?/sec in_list_i32 (1024, 0.2) IN (3, 0) 1.00 5.2±0.02µs ? ?/sec 1.30 6.8±0.03µs ? ?/sec in_list_utf8(10) (1024, 0) IN (1, 0) 1.16 4.5±0.02µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(10) (1024, 0) IN (10, 0) 1.16 4.5±0.01µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(10) (1024, 0) IN (100, 0) 1.16 4.5±0.03µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(10) (1024, 0) IN (3, 0) 1.16 4.5±0.04µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(10) (1024, 0.2) IN (1, 0) 1.06 5.9±0.03µs ? ?/sec 1.00 5.6±0.04µs ? ?/sec in_list_utf8(10) (1024, 0.2) IN (10, 0) 1.09 6.0±0.06µs ? ?/sec 1.00 5.5±0.03µs ? ?/sec in_list_utf8(10) (1024, 0.2) IN (100, 0) 1.07 6.0±0.03µs ? ?/sec 1.00 5.6±0.03µs ? ?/sec in_list_utf8(10) (1024, 0.2) IN (3, 0) 1.10 6.1±0.04µs ? ?/sec 1.00 5.6±0.03µs ? ?/sec in_list_utf8(20) (1024, 0) IN (1, 0) 1.16 4.5±0.02µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(20) (1024, 0) IN (10, 0) 1.16 4.5±0.01µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(20) (1024, 0) IN (100, 0) 1.15 4.5±0.04µs ? ?/sec 1.00 3.9±0.03µs ? ?/sec in_list_utf8(20) (1024, 0) IN (3, 0) 1.14 4.6±0.01µs ? ?/sec 1.00 4.0±0.03µs ? ?/sec in_list_utf8(20) (1024, 0.2) IN (1, 0) 1.06 6.0±0.02µs ? ?/sec 1.00 5.7±0.03µs ? ?/sec in_list_utf8(20) (1024, 0.2) IN (10, 0) 1.07 6.0±0.03µs ? ?/sec 1.00 5.6±0.03µs ? ?/sec in_list_utf8(20) (1024, 0.2) IN (100, 0) 1.09 6.1±0.03µs ? ?/sec 1.00 5.6±0.02µs ? ?/sec in_list_utf8(20) (1024, 0.2) IN (3, 0) 1.04 6.1±0.03µs ? ?/sec 1.00 5.8±0.05µs ? ?/sec in_list_utf8(5) (1024, 0) IN (1, 0) 1.15 4.5±0.01µs ? ?/sec 1.00 4.0±0.01µs ? ?/sec in_list_utf8(5) (1024, 0) IN (10, 0) 1.16 4.5±0.03µs ? ?/sec 1.00 3.9±0.01µs ? ?/sec in_list_utf8(5) (1024, 0) IN (100, 0) 1.16 4.5±0.01µs ? ?/sec 1.00 3.9±0.00µs ? ?/sec in_list_utf8(5) (1024, 0) IN (3, 0) 1.14 4.5±0.03µs ? ?/sec 1.00 4.0±0.02µs ? ?/sec in_list_utf8(5) (1024, 0.2) IN (1, 0) 1.06 5.9±0.04µs ? ?/sec 1.00 5.5±0.04µs ? ?/sec in_list_utf8(5) (1024, 0.2) IN (10, 0) 1.10 6.1±0.04µs ? ?/sec 1.00 5.5±0.03µs ? ?/sec in_list_utf8(5) (1024, 0.2) IN (100, 0) 1.11 6.0±0.03µs ? ?/sec 1.00 5.4±0.03µs ? ?/sec in_list_utf8(5) (1024, 0.2) IN (3, 0) 1.11 6.1±0.06µs ? ?/sec 1.00 5.5±0.04µs ? ?/sec
Results seem a bit mixed?
Yes. Slowdowns for i32 are concerning. I won’t merge this until it’s all speedups or neutral. I may also make a support PR to add more benchmarks for other types so we can make better comparisons.
I am hoping to look at this PR today or tomorrow and help push it along
I realized this new implementation differs from the existing one in ways which fix some bugs. I opened https://github.com/apache/datafusion/pull/19050 to demonstrate those bugs. So maybe lets first fix the bugs and merge the new tests, then we can continue here?
General comment on the benchmark but... am I reading them wrong, or is the null_percent input logic inverted?
fn do_benches(
c: &mut Criterion,
array_length: usize,
in_list_length: usize,
null_percent: f64,
) {
...
let values: Int32Array = (0..array_length)
.map(|_| rng.random_bool(null_percent).then(|| rng.random()))
.collect();
let in_list: Vec<_> = (0..in_list_length)
.map(|_| ScalarValue::Int32(Some(rng.random())))
.collect();
do_bench(
c,
&format!("in_list_i32 ({array_length}, {null_percent}) IN ({in_list_length}, 0)"),
Arc::new(values),
&in_list,
)
When null_percent is 0, won't this just create a array_length long values array of NULLs?
I think it should rather be something like:
fn do_benches(
c: &mut Criterion,
array_length: usize,
in_list_length: usize,
null_percent: f64,
) {
let non_null_percent = 1.0 - null_percent;
...
let values: Int32Array = (0..array_length)
.map(|_| rng.random_bool(non_null_percent).then(|| rng.random()))
.collect();
let in_list: Vec<_> = (0..in_list_length)
.map(|_| ScalarValue::Int32(Some(rng.random())))
.collect();
do_bench(
c,
&format!("in_list_i32 ({array_length}, {null_percent}) IN ({in_list_length}, 0)"),
Arc::new(values),
&in_list,
)
EDIT: I opened https://github.com/apache/datafusion/pull/19204 to fix this.
Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists.
@adriangb have you considered sorting the InList and doing a binary search?
For small lists (threshold to be refined...) this could be orders of magnitude faster that the overhead of hashing. To validate with the new fixed benchmarks of course!
Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists.
@adriangb have you considered sorting the
InListand doing a binary search?For small lists (threshold to be refined...) this could be orders of magnitude faster that the overhead of hashing. To validate with the new fixed benchmarks of course!
I think that's a really neat idea! I did look into using binary search at some point (not this PR) but then realized that there was already the hashing in place and didn't pursue it further. So while I agree it could be interesting to have two approaches, I think we should focus on improving the current approach right now especially since we probably degraded performance since last release. Maybe let's file a ticket to track benchmarking that approach?
Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists. @adriangb have you considered sorting the
InListand doing a binary search? For small lists (threshold to be refined...) this could be orders of magnitude faster that the overhead of hashing. To validate with the new fixed benchmarks of course!I think that's a really neat idea! I did look into using binary search at some point (not this PR) but then realized that there was already the hashing in place and didn't pursue it further. So while I agree it could be interesting to have two approaches, I think we should focus on improving the current approach right now especially since we probably degraded performance since last release. Maybe let's file a ticket to track benchmarking that approach?
This was too much fun to skip the opportunity! I've opened a draft PR on your branch that shows pretty cool results: https://github.com/pydantic/datafusion/pull/46
run benchmarks in_list
🤖 Hi @adriangb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3634442805).
scrape_comments.py only supports whitelisted benchmarks.
- Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
- Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner
Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...
~ hey real @alamb could you whitelist in_list for me (might as well do Daniel and Geoffrey at the same time) ~ nvm user error
run benchmark in_list
🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing specialize (d11d7ae6d79acc1cc4e815fd01e37bce6afae0ad) to e8a0829b7a2044ab477784d60ed2d208b97f5042 diff
BENCH_NAME=in_list
BENCH_COMMAND=cargo bench --all-features --bench in_list
BENCH_FILTER=
BENCH_BRANCH_NAME=specialize
Results will be posted here when complete
~ hey real @alamb could you whitelist
in_listfor me (might as well do Daniel and Geoffrey at the same time) ~ nvm user error
@Dandandan is already on the list https://github.com/alamb/datafusion-benchmarking/blob/4fb120785fa66ecbf40a45d8a5d0d5f4be17266a/scripts/scrape_comments.py#L41
I can add @geoffreyclaude if he would like
🤖 Hi @adriangb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3634468395).
scrape_comments.py only supports whitelisted benchmarks.
- Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
- Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner
Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...
run benchmarks in_list
🤖 Hi @alamb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3634537320).
scrape_comments.py only supports whitelisted benchmarks.
- Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
- Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner
Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...
The key is benchmark vs benchmarks fyi. I can make a PR to accept both tomorrow.
(I fixed the script to support
The key is
benchmarkvsbenchmarksfyi. I can make a PR to accept both tomorrow.
Yeah, I thought I could bash it out quickly but sadly it took more than 30 seconds
🤖: Benchmark completed
Details
group main specialize
----- ---- ----------
in_list/Float32/list=100/nulls=0% 1.01 307.1±8.23µs ? ?/sec 1.00 302.7±1.87µs ? ?/sec
in_list/Float32/list=100/nulls=20% 1.04 391.1±2.13µs ? ?/sec 1.00 374.4±4.60µs ? ?/sec
in_list/Float32/list=3/nulls=0% 1.03 14.6±0.17µs ? ?/sec 1.00 14.2±0.11µs ? ?/sec
in_list/Float32/list=3/nulls=20% 1.00 19.7±0.15µs ? ?/sec 1.00 19.7±0.13µs ? ?/sec
in_list/Float32/list=8/nulls=0% 1.00 29.1±0.36µs ? ?/sec 1.00 29.2±1.14µs ? ?/sec
in_list/Float32/list=8/nulls=20% 1.07 39.4±0.27µs ? ?/sec 1.00 36.8±0.47µs ? ?/sec
in_list/Int32/list=100/nulls=0% 1.00 5.9±0.13µs ? ?/sec 1.06 6.2±0.04µs ? ?/sec
in_list/Int32/list=100/nulls=20% 1.06 8.4±0.08µs ? ?/sec 1.00 7.9±0.11µs ? ?/sec
in_list/Int32/list=3/nulls=0% 1.00 5.6±0.06µs ? ?/sec 1.21 6.8±0.04µs ? ?/sec
in_list/Int32/list=3/nulls=20% 1.05 8.7±0.07µs ? ?/sec 1.00 8.3±0.11µs ? ?/sec
in_list/Int32/list=8/nulls=0% 1.00 6.0±0.04µs ? ?/sec 1.09 6.5±0.16µs ? ?/sec
in_list/Int32/list=8/nulls=20% 1.04 8.4±0.10µs ? ?/sec 1.00 8.1±0.04µs ? ?/sec
in_list/Utf8/list=100/nulls=0%/str=100 1.00 645.1±11.45µs ? ?/sec 1.00 645.7±12.49µs ? ?/sec
in_list/Utf8/list=100/nulls=0%/str=12 1.00 667.7±6.12µs ? ?/sec 1.03 685.0±3.93µs ? ?/sec
in_list/Utf8/list=100/nulls=0%/str=3 1.00 669.3±4.02µs ? ?/sec 1.06 707.1±7.55µs ? ?/sec
in_list/Utf8/list=100/nulls=20%/str=100 1.00 594.9±3.67µs ? ?/sec 1.02 604.8±12.27µs ? ?/sec
in_list/Utf8/list=100/nulls=20%/str=12 1.00 613.4±6.72µs ? ?/sec 1.02 627.5±5.92µs ? ?/sec
in_list/Utf8/list=100/nulls=20%/str=3 1.00 615.7±2.78µs ? ?/sec 1.02 627.1±3.23µs ? ?/sec
in_list/Utf8/list=3/nulls=0%/str=100 1.01 27.3±0.97µs ? ?/sec 1.00 27.0±1.15µs ? ?/sec
in_list/Utf8/list=3/nulls=0%/str=12 1.06 27.4±0.34µs ? ?/sec 1.00 25.9±0.40µs ? ?/sec
in_list/Utf8/list=3/nulls=0%/str=3 1.00 28.3±1.19µs ? ?/sec 1.04 29.4±3.04µs ? ?/sec
in_list/Utf8/list=3/nulls=20%/str=100 1.02 26.2±0.26µs ? ?/sec 1.00 25.8±0.51µs ? ?/sec
in_list/Utf8/list=3/nulls=20%/str=12 1.00 27.1±0.10µs ? ?/sec 1.05 28.5±1.12µs ? ?/sec
in_list/Utf8/list=3/nulls=20%/str=3 1.04 27.7±0.75µs ? ?/sec 1.00 26.5±0.55µs ? ?/sec
in_list/Utf8/list=8/nulls=0%/str=100 1.00 56.7±1.89µs ? ?/sec 1.00 56.4±0.99µs ? ?/sec
in_list/Utf8/list=8/nulls=0%/str=12 1.00 57.8±2.60µs ? ?/sec 1.02 59.2±0.56µs ? ?/sec
in_list/Utf8/list=8/nulls=0%/str=3 1.00 58.7±0.24µs ? ?/sec 1.09 64.3±6.59µs ? ?/sec
in_list/Utf8/list=8/nulls=20%/str=100 1.03 53.4±1.32µs ? ?/sec 1.00 51.9±0.10µs ? ?/sec
in_list/Utf8/list=8/nulls=20%/str=12 1.00 54.2±1.31µs ? ?/sec 1.00 54.5±0.69µs ? ?/sec
in_list/Utf8/list=8/nulls=20%/str=3 1.01 54.6±0.20µs ? ?/sec 1.00 53.8±0.53µs ? ?/sec
in_list/Utf8View/list=100/nulls=0%/str=100 1.00 597.3±9.22µs ? ?/sec 1.01 603.2±10.62µs ? ?/sec
in_list/Utf8View/list=100/nulls=0%/str=12 1.00 581.4±24.63µs ? ?/sec 1.00 581.1±13.15µs ? ?/sec
in_list/Utf8View/list=100/nulls=0%/str=3 1.01 582.3±12.05µs ? ?/sec 1.00 578.6±5.55µs ? ?/sec
in_list/Utf8View/list=100/nulls=20%/str=100 1.00 503.3±5.49µs ? ?/sec 1.02 515.1±27.03µs ? ?/sec
in_list/Utf8View/list=100/nulls=20%/str=12 1.00 510.9±6.95µs ? ?/sec 1.01 515.3±6.38µs ? ?/sec
in_list/Utf8View/list=100/nulls=20%/str=3 1.00 515.1±7.18µs ? ?/sec 1.01 522.1±13.63µs ? ?/sec
in_list/Utf8View/list=3/nulls=0%/str=100 1.05 22.2±0.28µs ? ?/sec 1.00 21.1±0.24µs ? ?/sec
in_list/Utf8View/list=3/nulls=0%/str=12 1.05 23.3±0.62µs ? ?/sec 1.00 22.3±0.30µs ? ?/sec
in_list/Utf8View/list=3/nulls=0%/str=3 1.04 23.3±0.18µs ? ?/sec 1.00 22.4±0.82µs ? ?/sec
in_list/Utf8View/list=3/nulls=20%/str=100 1.00 22.8±0.22µs ? ?/sec 1.02 23.2±0.24µs ? ?/sec
in_list/Utf8View/list=3/nulls=20%/str=12 1.00 23.4±0.18µs ? ?/sec 1.03 24.2±0.49µs ? ?/sec
in_list/Utf8View/list=3/nulls=20%/str=3 1.00 23.7±0.37µs ? ?/sec 1.02 24.2±1.51µs ? ?/sec
in_list/Utf8View/list=8/nulls=0%/str=100 1.05 49.6±3.83µs ? ?/sec 1.00 47.2±0.83µs ? ?/sec
in_list/Utf8View/list=8/nulls=0%/str=12 1.00 50.5±0.65µs ? ?/sec 1.18 59.7±1.62µs ? ?/sec
in_list/Utf8View/list=8/nulls=0%/str=3 1.00 50.6±0.72µs ? ?/sec 1.17 59.5±0.70µs ? ?/sec
in_list/Utf8View/list=8/nulls=20%/str=100 1.00 45.4±2.06µs ? ?/sec 1.06 48.2±1.59µs ? ?/sec
in_list/Utf8View/list=8/nulls=20%/str=12 1.01 47.3±0.67µs ? ?/sec 1.00 46.8±0.83µs ? ?/sec
in_list/Utf8View/list=8/nulls=20%/str=3 1.01 49.1±0.36µs ? ?/sec 1.00 48.6±1.29µs ? ?/sec
@Dandandan is already on the list https://github.com/alamb/datafusion-benchmarking/blob/4fb120785fa66ecbf40a45d8a5d0d5f4be17266a/scripts/scrape_comments.py#L41
I can add @geoffreyclaude if he would like
@alamb yes please :) Especially for when/if I follow through with the additional perf improvements.
run benchmarks in_list
🤖 Hi @adriangb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3637437969).
scrape_comments.py only supports whitelisted benchmarks.
- Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
- Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner
Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...