datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

add specialized InList implementations for common scalar types

Open adriangb opened this issue 1 month ago • 45 comments

Closes #18824

adriangb avatar Nov 20 '25 00:11 adriangb

@alamb maybe let's run benchmarks?

adriangb avatar Nov 20 '25 00:11 adriangb

Here's what I'm seeing so far:

Function & State main (us) specialized (us) Change
in_list_utf8(5) (1024, 0) IN (1, 0) 3.6205 4.8594 Regressed (+34.239%)
in_list_utf8(10) (1024, 0) IN (1, 0) 3.6363 5.1001 Regressed (+40.312%)
in_list_utf8(20) (1024, 0) IN (1, 0) 3.6707 4.9045 Regressed (+33.600%)
in_list_f32 (1024, 0) IN (1, 0) 3.5255 3.2288 Improved (−8.2971%)
in_list_i32 (1024, 0) IN (1, 0) 3.4923 3.6403 Regressed (+3.8772%)
in_list_utf8(5) (1024, 0.2) IN (1, 0) 4.9284 6.3583 Regressed (+29.002%)
in_list_utf8(10) (1024, 0.2) IN (1, 0) 4.9788 6.3273 Regressed (+27.064%)
in_list_utf8(20) (1024, 0.2) IN (1, 0) 5.1145 6.4905 Regressed (+26.907%)
in_list_f32 (1024, 0.2) IN (1, 0) 4.4894 4.2540 Improved (−5.2512%)
in_list_i32 (1024, 0.2) IN (1, 0) 4.4055 4.2869 Improved (−2.7526%)
in_list_utf8(5) (1024, 0) IN (3, 0) 3.5576 4.8733 Regressed (+37.000%)
in_list_utf8(10) (1024, 0) IN (3, 0) 3.4999 4.8737 Regressed (+37.108%)
in_list_utf8(20) (1024, 0) IN (3, 0) 3.4897 4.9039 Regressed (+40.449%)
in_list_f32 (1024, 0) IN (3, 0) 3.5114 3.2550 Improved (−7.3270%)
in_list_i32 (1024, 0) IN (3, 0) 3.5199 3.4879 Noise Threshold (−0.9158%)
in_list_utf8(5) (1024, 0.2) IN (3, 0) 5.0196 6.2232 Regressed (+23.865%)
in_list_utf8(10) (1024, 0.2) IN (3, 0) 5.0453 6.5280 Regressed (+28.004%)
in_list_utf8(20) (1024, 0.2) IN (3, 0) 5.1005 11.124 Regressed (+118.13%)
in_list_f32 (1024, 0.2) IN (3, 0) 4.5049 4.3126 Improved (−4.6367%)
in_list_i32 (1024, 0.2) IN (3, 0) 4.4026 4.4424 Noise Threshold (+1.0449%)
in_list_utf8(5) (1024, 0) IN (10, 0) 3.5275 4.9709 Regressed (+40.969%)
in_list_utf8(10) (1024, 0) IN (10, 0) 3.5613 4.9125 Regressed (+37.386%)
in_list_utf8(20) (1024, 0) IN (10, 0) 3.5589 4.8611 Regressed (+36.578%)
in_list_f32 (1024, 0) IN (10, 0) 3.5596 3.2017 Improved (−10.003%)
in_list_i32 (1024, 0) IN (10, 0) 3.4917 3.5190 Noise Threshold (+1.0216%)
in_list_utf8(5) (1024, 0.2) IN (10, 0) 5.0368 6.6162 Regressed (+31.725%)
in_list_utf8(10) (1024, 0.2) IN (10, 0) 5.0980 6.6660 Regressed (+30.039%)
in_list_utf8(20) (1024, 0.2) IN (10, 0) 5.2543 6.5350 Regressed (+24.335%)
in_list_f32 (1024, 0.2) IN (10, 0) 4.5609 4.3248 Improved (−5.3746%)
in_list_i32 (1024, 0.2) IN (10, 0) 4.4460 4.3543 Improved (−2.7754%)
in_list_utf8(5) (1024, 0) IN (100, 0) 3.6542 4.9992 Regressed (+36.952%)
in_list_utf8(10) (1024, 0) IN (100, 0) 3.6529 4.8772 Regressed (+33.560%)
in_list_utf8(20) (1024, 0) IN (100, 0) 3.6155 5.0250 Regressed (+39.127%)
in_list_f32 (1024, 0) IN (100, 0) 3.6029 3.2448 Improved (−9.7603%)
in_list_i32 (1024, 0) IN (100, 0) 3.6048 3.4770 Improved (−3.5307%)
in_list_utf8(5) (1024, 0.2) IN (100, 0) 5.3988 6.6108 Regressed (+22.866%)
in_list_utf8(10) (1024, 0.2) IN (100, 0) 5.4776 6.6567 Regressed (+21.591%)
in_list_utf8(20) (1024, 0.2) IN (100, 0) 5.3721 6.6470 Regressed (+24.058%)
in_list_f32 (1024, 0.2) IN (100, 0) 4.7693 4.3281 Improved (−9.9814%)
in_list_i32 (1024, 0.2) IN (100, 0) 4.5402 4.4076 Improved (−3.5247%)

I think we'd need to add benchmarks for other primitive types. And it's interesting that Utf8 regresses a lot across the board. I guess the benefits of vectorization / computing the entire hashes at once outweighs the dynamic dispatch?

adriangb avatar Nov 20 '25 07:11 adriangb

Thinking about it the trick is probably to avoid the extra to_string()/to_vec() allocation and store just the hashes. Will look into that.

adriangb avatar Nov 20 '25 08:11 adriangb

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing specialize (1e4782f1cde36a83398002a4730ff96123b547a3) to 7d8b8602ad1be2f61f6a8ebb253ace9d85304ea7 diff using: tpch_mem clickbench_partitioned clickbench_extended Results will be posted here when complete

alamb avatar Nov 22 '25 10:11 alamb

🤖: Benchmark completed

Details

Comparing HEAD and specialize
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃  specialize ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2653.29 ms │  2643.63 ms │    no change │
│ QQuery 1     │  1306.60 ms │  1281.59 ms │    no change │
│ QQuery 2     │  2455.20 ms │  2370.32 ms │    no change │
│ QQuery 3     │  1156.08 ms │  1114.79 ms │    no change │
│ QQuery 4     │  2330.46 ms │  2304.88 ms │    no change │
│ QQuery 5     │ 28126.46 ms │ 28010.99 ms │    no change │
│ QQuery 6     │  4067.69 ms │  4060.48 ms │    no change │
│ QQuery 7     │  3638.82 ms │  3851.00 ms │ 1.06x slower │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)         │ 45734.60ms │
│ Total Time (specialize)   │ 45637.68ms │
│ Average Time (HEAD)       │  5716.82ms │
│ Average Time (specialize) │  5704.71ms │
│ Queries Faster            │          0 │
│ Queries Slower            │          1 │
│ Queries with No Change    │          7 │
│ Queries with Failure      │          0 │
└───────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃  specialize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.51 ms │     2.23 ms │ +1.13x faster │
│ QQuery 1     │    51.64 ms │    50.17 ms │     no change │
│ QQuery 2     │   136.71 ms │   138.99 ms │     no change │
│ QQuery 3     │   164.21 ms │   165.13 ms │     no change │
│ QQuery 4     │  1070.98 ms │  1088.91 ms │     no change │
│ QQuery 5     │  1510.36 ms │  1485.90 ms │     no change │
│ QQuery 6     │     2.17 ms │     2.20 ms │     no change │
│ QQuery 7     │    55.51 ms │    55.89 ms │     no change │
│ QQuery 8     │  1426.35 ms │  1446.26 ms │     no change │
│ QQuery 9     │  1872.18 ms │  1829.87 ms │     no change │
│ QQuery 10    │   380.85 ms │   367.57 ms │     no change │
│ QQuery 11    │   458.71 ms │   417.23 ms │ +1.10x faster │
│ QQuery 12    │  1408.78 ms │  1382.73 ms │     no change │
│ QQuery 13    │  2081.78 ms │  2148.16 ms │     no change │
│ QQuery 14    │  1242.39 ms │  1286.95 ms │     no change │
│ QQuery 15    │  1216.37 ms │  1262.77 ms │     no change │
│ QQuery 16    │  2679.04 ms │  2730.67 ms │     no change │
│ QQuery 17    │  2657.52 ms │  2666.04 ms │     no change │
│ QQuery 18    │  5347.62 ms │  4996.68 ms │ +1.07x faster │
│ QQuery 19    │   129.19 ms │   126.78 ms │     no change │
│ QQuery 20    │  2020.18 ms │  1974.03 ms │     no change │
│ QQuery 21    │  2330.14 ms │  2291.15 ms │     no change │
│ QQuery 22    │  3990.08 ms │  3921.02 ms │     no change │
│ QQuery 23    │ 12864.85 ms │ 12678.44 ms │     no change │
│ QQuery 24    │   214.91 ms │   197.85 ms │ +1.09x faster │
│ QQuery 25    │   477.06 ms │   465.64 ms │     no change │
│ QQuery 26    │   221.72 ms │   211.14 ms │     no change │
│ QQuery 27    │  2818.62 ms │  2800.08 ms │     no change │
│ QQuery 28    │ 23386.04 ms │ 24032.36 ms │     no change │
│ QQuery 29    │   963.31 ms │   964.57 ms │     no change │
│ QQuery 30    │  1318.41 ms │  1329.87 ms │     no change │
│ QQuery 31    │  1378.81 ms │  1352.16 ms │     no change │
│ QQuery 32    │  4881.46 ms │  5040.19 ms │     no change │
│ QQuery 33    │  5905.32 ms │  5981.60 ms │     no change │
│ QQuery 34    │  6119.86 ms │  6348.76 ms │     no change │
│ QQuery 35    │  1942.66 ms │  1904.86 ms │     no change │
│ QQuery 36    │   120.95 ms │   120.81 ms │     no change │
│ QQuery 37    │    52.95 ms │    53.06 ms │     no change │
│ QQuery 38    │   121.94 ms │   120.66 ms │     no change │
│ QQuery 39    │   196.99 ms │   198.57 ms │     no change │
│ QQuery 40    │    43.90 ms │    41.87 ms │     no change │
│ QQuery 41    │    39.75 ms │    37.46 ms │ +1.06x faster │
│ QQuery 42    │    32.87 ms │    31.87 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)         │ 95337.65ms │
│ Total Time (specialize)   │ 95749.15ms │
│ Average Time (HEAD)       │  2217.15ms │
│ Average Time (specialize) │  2226.72ms │
│ Queries Faster            │          5 │
│ Queries Slower            │          0 │
│ Queries with No Change    │         38 │
│ Queries with Failure      │          0 │
└───────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ specialize ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 139.55 ms │  138.68 ms │     no change │
│ QQuery 2     │  27.47 ms │   28.88 ms │  1.05x slower │
│ QQuery 3     │  38.06 ms │   37.65 ms │     no change │
│ QQuery 4     │  29.40 ms │   28.38 ms │     no change │
│ QQuery 5     │  87.81 ms │   86.86 ms │     no change │
│ QQuery 6     │  19.52 ms │   19.58 ms │     no change │
│ QQuery 7     │ 222.64 ms │  227.15 ms │     no change │
│ QQuery 8     │  34.19 ms │   35.26 ms │     no change │
│ QQuery 9     │ 102.37 ms │  105.70 ms │     no change │
│ QQuery 10    │  62.63 ms │   64.92 ms │     no change │
│ QQuery 11    │  18.73 ms │   18.59 ms │     no change │
│ QQuery 12    │  51.03 ms │   52.34 ms │     no change │
│ QQuery 13    │  48.65 ms │   48.66 ms │     no change │
│ QQuery 14    │  13.94 ms │   13.77 ms │     no change │
│ QQuery 15    │  25.10 ms │   24.91 ms │     no change │
│ QQuery 16    │  25.28 ms │   24.76 ms │     no change │
│ QQuery 17    │ 153.69 ms │  150.64 ms │     no change │
│ QQuery 18    │ 277.48 ms │  281.61 ms │     no change │
│ QQuery 19    │  36.61 ms │   36.42 ms │     no change │
│ QQuery 20    │  50.79 ms │   48.40 ms │     no change │
│ QQuery 21    │ 338.03 ms │  318.12 ms │ +1.06x faster │
│ QQuery 22    │  20.42 ms │   17.44 ms │ +1.17x faster │
└──────────────┴───────────┴────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)         │ 1823.38ms │
│ Total Time (specialize)   │ 1808.75ms │
│ Average Time (HEAD)       │   82.88ms │
│ Average Time (specialize) │   82.22ms │
│ Queries Faster            │         2 │
│ Queries Slower            │         1 │
│ Queries with No Change    │        19 │
│ Queries with Failure      │         0 │
└───────────────────────────┴───────────┘

alamb avatar Nov 22 '25 11:11 alamb

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing specialize (1e4782f1cde36a83398002a4730ff96123b547a3) to 7d8b8602ad1be2f61f6a8ebb253ace9d85304ea7 diff BENCH_NAME=in_list BENCH_COMMAND=cargo bench --bench in_list BENCH_FILTER= BENCH_BRANCH_NAME=specialize Results will be posted here when complete

alamb avatar Nov 22 '25 12:11 alamb

🤖: Benchmark completed

Details

group                                       main                                   specialize
-----                                       ----                                   ----------
in_list_f32 (1024, 0) IN (1, 0)             1.16      4.2±0.01µs        ? ?/sec    1.00      3.6±0.02µs        ? ?/sec
in_list_f32 (1024, 0) IN (10, 0)            1.16      4.2±0.01µs        ? ?/sec    1.00      3.6±0.02µs        ? ?/sec
in_list_f32 (1024, 0) IN (100, 0)           1.16      4.3±0.01µs        ? ?/sec    1.00      3.7±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (3, 0)             1.17      4.3±0.10µs        ? ?/sec    1.00      3.6±0.01µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (1, 0)           1.00      5.1±0.02µs        ? ?/sec    1.13      5.8±0.07µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (10, 0)          1.00      5.3±0.03µs        ? ?/sec    1.09      5.7±0.02µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (100, 0)         1.00      5.3±0.04µs        ? ?/sec    1.13      5.9±0.10µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (3, 0)           1.00      5.2±0.04µs        ? ?/sec    1.12      5.8±0.05µs        ? ?/sec
in_list_i32 (1024, 0) IN (1, 0)             1.00      4.3±0.01µs        ? ?/sec    1.11      4.7±0.02µs        ? ?/sec
in_list_i32 (1024, 0) IN (10, 0)            1.00      4.2±0.01µs        ? ?/sec    1.11      4.7±0.04µs        ? ?/sec
in_list_i32 (1024, 0) IN (100, 0)           1.00      4.3±0.01µs        ? ?/sec    1.11      4.7±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (3, 0)             1.00      4.3±0.01µs        ? ?/sec    1.11      4.7±0.07µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (1, 0)           1.00      5.2±0.02µs        ? ?/sec    1.32      6.8±0.39µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (10, 0)          1.00      5.2±0.02µs        ? ?/sec    1.25      6.5±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (100, 0)         1.00      5.2±0.03µs        ? ?/sec    1.16      6.1±0.06µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (3, 0)           1.00      5.2±0.02µs        ? ?/sec    1.30      6.8±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (1, 0)        1.16      4.5±0.02µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (10, 0)       1.16      4.5±0.01µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (100, 0)      1.16      4.5±0.03µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (3, 0)        1.16      4.5±0.04µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (1, 0)      1.06      5.9±0.03µs        ? ?/sec    1.00      5.6±0.04µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (10, 0)     1.09      6.0±0.06µs        ? ?/sec    1.00      5.5±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (100, 0)    1.07      6.0±0.03µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (3, 0)      1.10      6.1±0.04µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (1, 0)        1.16      4.5±0.02µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (10, 0)       1.16      4.5±0.01µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (100, 0)      1.15      4.5±0.04µs        ? ?/sec    1.00      3.9±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (3, 0)        1.14      4.6±0.01µs        ? ?/sec    1.00      4.0±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (1, 0)      1.06      6.0±0.02µs        ? ?/sec    1.00      5.7±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (10, 0)     1.07      6.0±0.03µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (100, 0)    1.09      6.1±0.03µs        ? ?/sec    1.00      5.6±0.02µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (3, 0)      1.04      6.1±0.03µs        ? ?/sec    1.00      5.8±0.05µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (1, 0)         1.15      4.5±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (10, 0)        1.16      4.5±0.03µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (100, 0)       1.16      4.5±0.01µs        ? ?/sec    1.00      3.9±0.00µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (3, 0)         1.14      4.5±0.03µs        ? ?/sec    1.00      4.0±0.02µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (1, 0)       1.06      5.9±0.04µs        ? ?/sec    1.00      5.5±0.04µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (10, 0)      1.10      6.1±0.04µs        ? ?/sec    1.00      5.5±0.03µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (100, 0)     1.11      6.0±0.03µs        ? ?/sec    1.00      5.4±0.03µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (3, 0)       1.11      6.1±0.06µs        ? ?/sec    1.00      5.5±0.04µs        ? ?/sec

alamb avatar Nov 22 '25 12:11 alamb

🤖: Benchmark completed

Details

group                                       main                                   specialize
-----                                       ----                                   ----------
in_list_f32 (1024, 0) IN (1, 0)             1.16      4.2±0.01µs        ? ?/sec    1.00      3.6±0.02µs        ? ?/sec
in_list_f32 (1024, 0) IN (10, 0)            1.16      4.2±0.01µs        ? ?/sec    1.00      3.6±0.02µs        ? ?/sec
in_list_f32 (1024, 0) IN (100, 0)           1.16      4.3±0.01µs        ? ?/sec    1.00      3.7±0.01µs        ? ?/sec
in_list_f32 (1024, 0) IN (3, 0)             1.17      4.3±0.10µs        ? ?/sec    1.00      3.6±0.01µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (1, 0)           1.00      5.1±0.02µs        ? ?/sec    1.13      5.8±0.07µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (10, 0)          1.00      5.3±0.03µs        ? ?/sec    1.09      5.7±0.02µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (100, 0)         1.00      5.3±0.04µs        ? ?/sec    1.13      5.9±0.10µs        ? ?/sec
in_list_f32 (1024, 0.2) IN (3, 0)           1.00      5.2±0.04µs        ? ?/sec    1.12      5.8±0.05µs        ? ?/sec
in_list_i32 (1024, 0) IN (1, 0)             1.00      4.3±0.01µs        ? ?/sec    1.11      4.7±0.02µs        ? ?/sec
in_list_i32 (1024, 0) IN (10, 0)            1.00      4.2±0.01µs        ? ?/sec    1.11      4.7±0.04µs        ? ?/sec
in_list_i32 (1024, 0) IN (100, 0)           1.00      4.3±0.01µs        ? ?/sec    1.11      4.7±0.01µs        ? ?/sec
in_list_i32 (1024, 0) IN (3, 0)             1.00      4.3±0.01µs        ? ?/sec    1.11      4.7±0.07µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (1, 0)           1.00      5.2±0.02µs        ? ?/sec    1.32      6.8±0.39µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (10, 0)          1.00      5.2±0.02µs        ? ?/sec    1.25      6.5±0.03µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (100, 0)         1.00      5.2±0.03µs        ? ?/sec    1.16      6.1±0.06µs        ? ?/sec
in_list_i32 (1024, 0.2) IN (3, 0)           1.00      5.2±0.02µs        ? ?/sec    1.30      6.8±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (1, 0)        1.16      4.5±0.02µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (10, 0)       1.16      4.5±0.01µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (100, 0)      1.16      4.5±0.03µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0) IN (3, 0)        1.16      4.5±0.04µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (1, 0)      1.06      5.9±0.03µs        ? ?/sec    1.00      5.6±0.04µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (10, 0)     1.09      6.0±0.06µs        ? ?/sec    1.00      5.5±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (100, 0)    1.07      6.0±0.03µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
in_list_utf8(10) (1024, 0.2) IN (3, 0)      1.10      6.1±0.04µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (1, 0)        1.16      4.5±0.02µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (10, 0)       1.16      4.5±0.01µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (100, 0)      1.15      4.5±0.04µs        ? ?/sec    1.00      3.9±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0) IN (3, 0)        1.14      4.6±0.01µs        ? ?/sec    1.00      4.0±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (1, 0)      1.06      6.0±0.02µs        ? ?/sec    1.00      5.7±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (10, 0)     1.07      6.0±0.03µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (100, 0)    1.09      6.1±0.03µs        ? ?/sec    1.00      5.6±0.02µs        ? ?/sec
in_list_utf8(20) (1024, 0.2) IN (3, 0)      1.04      6.1±0.03µs        ? ?/sec    1.00      5.8±0.05µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (1, 0)         1.15      4.5±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (10, 0)        1.16      4.5±0.03µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (100, 0)       1.16      4.5±0.01µs        ? ?/sec    1.00      3.9±0.00µs        ? ?/sec
in_list_utf8(5) (1024, 0) IN (3, 0)         1.14      4.5±0.03µs        ? ?/sec    1.00      4.0±0.02µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (1, 0)       1.06      5.9±0.04µs        ? ?/sec    1.00      5.5±0.04µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (10, 0)      1.10      6.1±0.04µs        ? ?/sec    1.00      5.5±0.03µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (100, 0)     1.11      6.0±0.03µs        ? ?/sec    1.00      5.4±0.03µs        ? ?/sec
in_list_utf8(5) (1024, 0.2) IN (3, 0)       1.11      6.1±0.06µs        ? ?/sec    1.00      5.5±0.04µs        ? ?/sec

Results seem a bit mixed?

Dandandan avatar Nov 23 '25 06:11 Dandandan

Yes. Slowdowns for i32 are concerning. I won’t merge this until it’s all speedups or neutral. I may also make a support PR to add more benchmarks for other types so we can make better comparisons.

adriangb avatar Nov 23 '25 06:11 adriangb

I am hoping to look at this PR today or tomorrow and help push it along

alamb avatar Dec 02 '25 12:12 alamb

I realized this new implementation differs from the existing one in ways which fix some bugs. I opened https://github.com/apache/datafusion/pull/19050 to demonstrate those bugs. So maybe lets first fix the bugs and merge the new tests, then we can continue here?

adriangb avatar Dec 02 '25 18:12 adriangb

General comment on the benchmark but... am I reading them wrong, or is the null_percent input logic inverted?

fn do_benches(
    c: &mut Criterion,
    array_length: usize,
    in_list_length: usize,
    null_percent: f64,
) {
...
    let values: Int32Array = (0..array_length)
        .map(|_| rng.random_bool(null_percent).then(|| rng.random()))
        .collect();

    let in_list: Vec<_> = (0..in_list_length)
        .map(|_| ScalarValue::Int32(Some(rng.random())))
        .collect();

    do_bench(
        c,
        &format!("in_list_i32 ({array_length}, {null_percent}) IN ({in_list_length}, 0)"),
        Arc::new(values),
        &in_list,
    )

When null_percent is 0, won't this just create a array_length long values array of NULLs?

I think it should rather be something like:

fn do_benches(
    c: &mut Criterion,
    array_length: usize,
    in_list_length: usize,
    null_percent: f64,
) {
    let non_null_percent = 1.0 - null_percent;
...
    let values: Int32Array = (0..array_length)
        .map(|_| rng.random_bool(non_null_percent).then(|| rng.random()))
        .collect();

    let in_list: Vec<_> = (0..in_list_length)
        .map(|_| ScalarValue::Int32(Some(rng.random())))
        .collect();

    do_bench(
        c,
        &format!("in_list_i32 ({array_length}, {null_percent}) IN ({in_list_length}, 0)"),
        Arc::new(values),
        &in_list,
    )

EDIT: I opened https://github.com/apache/datafusion/pull/19204 to fix this.

geoffreyclaude avatar Dec 08 '25 09:12 geoffreyclaude

Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists.

@adriangb have you considered sorting the InList and doing a binary search?

For small lists (threshold to be refined...) this could be orders of magnitude faster that the overhead of hashing. To validate with the new fixed benchmarks of course!

geoffreyclaude avatar Dec 08 '25 09:12 geoffreyclaude

Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists.

@adriangb have you considered sorting the InList and doing a binary search?

For small lists (threshold to be refined...) this could be orders of magnitude faster that the overhead of hashing. To validate with the new fixed benchmarks of course!

I think that's a really neat idea! I did look into using binary search at some point (not this PR) but then realized that there was already the hashing in place and didn't pursue it further. So while I agree it could be interesting to have two approaches, I think we should focus on improving the current approach right now especially since we probably degraded performance since last release. Maybe let's file a ticket to track benchmarking that approach?

adriangb avatar Dec 08 '25 17:12 adriangb

Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists. @adriangb have you considered sorting the InList and doing a binary search? For small lists (threshold to be refined...) this could be orders of magnitude faster that the overhead of hashing. To validate with the new fixed benchmarks of course!

I think that's a really neat idea! I did look into using binary search at some point (not this PR) but then realized that there was already the hashing in place and didn't pursue it further. So while I agree it could be interesting to have two approaches, I think we should focus on improving the current approach right now especially since we probably degraded performance since last release. Maybe let's file a ticket to track benchmarking that approach?

This was too much fun to skip the opportunity! I've opened a draft PR on your branch that shows pretty cool results: https://github.com/pydantic/datafusion/pull/46

geoffreyclaude avatar Dec 08 '25 18:12 geoffreyclaude

run benchmarks in_list

adriangb avatar Dec 09 '25 21:12 adriangb

🤖 Hi @adriangb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3634442805).

scrape_comments.py only supports whitelisted benchmarks.

  • Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
  • Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...

alamb-ghbot avatar Dec 09 '25 21:12 alamb-ghbot

~ hey real @alamb could you whitelist in_list for me (might as well do Daniel and Geoffrey at the same time) ~ nvm user error

adriangb avatar Dec 09 '25 22:12 adriangb

run benchmark in_list

adriangb avatar Dec 09 '25 22:12 adriangb

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing specialize (d11d7ae6d79acc1cc4e815fd01e37bce6afae0ad) to e8a0829b7a2044ab477784d60ed2d208b97f5042 diff BENCH_NAME=in_list BENCH_COMMAND=cargo bench --all-features --bench in_list BENCH_FILTER= BENCH_BRANCH_NAME=specialize Results will be posted here when complete

alamb-ghbot avatar Dec 09 '25 22:12 alamb-ghbot

~ hey real @alamb could you whitelist in_list for me (might as well do Daniel and Geoffrey at the same time) ~ nvm user error

@Dandandan is already on the list https://github.com/alamb/datafusion-benchmarking/blob/4fb120785fa66ecbf40a45d8a5d0d5f4be17266a/scripts/scrape_comments.py#L41

I can add @geoffreyclaude if he would like

alamb avatar Dec 09 '25 22:12 alamb

🤖 Hi @adriangb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3634468395).

scrape_comments.py only supports whitelisted benchmarks.

  • Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
  • Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...

alamb-ghbot avatar Dec 09 '25 22:12 alamb-ghbot

run benchmarks in_list

alamb avatar Dec 09 '25 22:12 alamb

🤖 Hi @alamb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3634537320).

scrape_comments.py only supports whitelisted benchmarks.

  • Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
  • Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...

alamb-ghbot avatar Dec 09 '25 22:12 alamb-ghbot

The key is benchmark vs benchmarks fyi. I can make a PR to accept both tomorrow.

adriangb avatar Dec 09 '25 22:12 adriangb

(I fixed the script to support

The key is benchmark vs benchmarks fyi. I can make a PR to accept both tomorrow.

Yeah, I thought I could bash it out quickly but sadly it took more than 30 seconds

alamb avatar Dec 09 '25 22:12 alamb

🤖: Benchmark completed

Details

group                                          main                                   specialize
-----                                          ----                                   ----------
in_list/Float32/list=100/nulls=0%              1.01    307.1±8.23µs        ? ?/sec    1.00    302.7±1.87µs        ? ?/sec
in_list/Float32/list=100/nulls=20%             1.04    391.1±2.13µs        ? ?/sec    1.00    374.4±4.60µs        ? ?/sec
in_list/Float32/list=3/nulls=0%                1.03     14.6±0.17µs        ? ?/sec    1.00     14.2±0.11µs        ? ?/sec
in_list/Float32/list=3/nulls=20%               1.00     19.7±0.15µs        ? ?/sec    1.00     19.7±0.13µs        ? ?/sec
in_list/Float32/list=8/nulls=0%                1.00     29.1±0.36µs        ? ?/sec    1.00     29.2±1.14µs        ? ?/sec
in_list/Float32/list=8/nulls=20%               1.07     39.4±0.27µs        ? ?/sec    1.00     36.8±0.47µs        ? ?/sec
in_list/Int32/list=100/nulls=0%                1.00      5.9±0.13µs        ? ?/sec    1.06      6.2±0.04µs        ? ?/sec
in_list/Int32/list=100/nulls=20%               1.06      8.4±0.08µs        ? ?/sec    1.00      7.9±0.11µs        ? ?/sec
in_list/Int32/list=3/nulls=0%                  1.00      5.6±0.06µs        ? ?/sec    1.21      6.8±0.04µs        ? ?/sec
in_list/Int32/list=3/nulls=20%                 1.05      8.7±0.07µs        ? ?/sec    1.00      8.3±0.11µs        ? ?/sec
in_list/Int32/list=8/nulls=0%                  1.00      6.0±0.04µs        ? ?/sec    1.09      6.5±0.16µs        ? ?/sec
in_list/Int32/list=8/nulls=20%                 1.04      8.4±0.10µs        ? ?/sec    1.00      8.1±0.04µs        ? ?/sec
in_list/Utf8/list=100/nulls=0%/str=100         1.00   645.1±11.45µs        ? ?/sec    1.00   645.7±12.49µs        ? ?/sec
in_list/Utf8/list=100/nulls=0%/str=12          1.00    667.7±6.12µs        ? ?/sec    1.03    685.0±3.93µs        ? ?/sec
in_list/Utf8/list=100/nulls=0%/str=3           1.00    669.3±4.02µs        ? ?/sec    1.06    707.1±7.55µs        ? ?/sec
in_list/Utf8/list=100/nulls=20%/str=100        1.00    594.9±3.67µs        ? ?/sec    1.02   604.8±12.27µs        ? ?/sec
in_list/Utf8/list=100/nulls=20%/str=12         1.00    613.4±6.72µs        ? ?/sec    1.02    627.5±5.92µs        ? ?/sec
in_list/Utf8/list=100/nulls=20%/str=3          1.00    615.7±2.78µs        ? ?/sec    1.02    627.1±3.23µs        ? ?/sec
in_list/Utf8/list=3/nulls=0%/str=100           1.01     27.3±0.97µs        ? ?/sec    1.00     27.0±1.15µs        ? ?/sec
in_list/Utf8/list=3/nulls=0%/str=12            1.06     27.4±0.34µs        ? ?/sec    1.00     25.9±0.40µs        ? ?/sec
in_list/Utf8/list=3/nulls=0%/str=3             1.00     28.3±1.19µs        ? ?/sec    1.04     29.4±3.04µs        ? ?/sec
in_list/Utf8/list=3/nulls=20%/str=100          1.02     26.2±0.26µs        ? ?/sec    1.00     25.8±0.51µs        ? ?/sec
in_list/Utf8/list=3/nulls=20%/str=12           1.00     27.1±0.10µs        ? ?/sec    1.05     28.5±1.12µs        ? ?/sec
in_list/Utf8/list=3/nulls=20%/str=3            1.04     27.7±0.75µs        ? ?/sec    1.00     26.5±0.55µs        ? ?/sec
in_list/Utf8/list=8/nulls=0%/str=100           1.00     56.7±1.89µs        ? ?/sec    1.00     56.4±0.99µs        ? ?/sec
in_list/Utf8/list=8/nulls=0%/str=12            1.00     57.8±2.60µs        ? ?/sec    1.02     59.2±0.56µs        ? ?/sec
in_list/Utf8/list=8/nulls=0%/str=3             1.00     58.7±0.24µs        ? ?/sec    1.09     64.3±6.59µs        ? ?/sec
in_list/Utf8/list=8/nulls=20%/str=100          1.03     53.4±1.32µs        ? ?/sec    1.00     51.9±0.10µs        ? ?/sec
in_list/Utf8/list=8/nulls=20%/str=12           1.00     54.2±1.31µs        ? ?/sec    1.00     54.5±0.69µs        ? ?/sec
in_list/Utf8/list=8/nulls=20%/str=3            1.01     54.6±0.20µs        ? ?/sec    1.00     53.8±0.53µs        ? ?/sec
in_list/Utf8View/list=100/nulls=0%/str=100     1.00    597.3±9.22µs        ? ?/sec    1.01   603.2±10.62µs        ? ?/sec
in_list/Utf8View/list=100/nulls=0%/str=12      1.00   581.4±24.63µs        ? ?/sec    1.00   581.1±13.15µs        ? ?/sec
in_list/Utf8View/list=100/nulls=0%/str=3       1.01   582.3±12.05µs        ? ?/sec    1.00    578.6±5.55µs        ? ?/sec
in_list/Utf8View/list=100/nulls=20%/str=100    1.00    503.3±5.49µs        ? ?/sec    1.02   515.1±27.03µs        ? ?/sec
in_list/Utf8View/list=100/nulls=20%/str=12     1.00    510.9±6.95µs        ? ?/sec    1.01    515.3±6.38µs        ? ?/sec
in_list/Utf8View/list=100/nulls=20%/str=3      1.00    515.1±7.18µs        ? ?/sec    1.01   522.1±13.63µs        ? ?/sec
in_list/Utf8View/list=3/nulls=0%/str=100       1.05     22.2±0.28µs        ? ?/sec    1.00     21.1±0.24µs        ? ?/sec
in_list/Utf8View/list=3/nulls=0%/str=12        1.05     23.3±0.62µs        ? ?/sec    1.00     22.3±0.30µs        ? ?/sec
in_list/Utf8View/list=3/nulls=0%/str=3         1.04     23.3±0.18µs        ? ?/sec    1.00     22.4±0.82µs        ? ?/sec
in_list/Utf8View/list=3/nulls=20%/str=100      1.00     22.8±0.22µs        ? ?/sec    1.02     23.2±0.24µs        ? ?/sec
in_list/Utf8View/list=3/nulls=20%/str=12       1.00     23.4±0.18µs        ? ?/sec    1.03     24.2±0.49µs        ? ?/sec
in_list/Utf8View/list=3/nulls=20%/str=3        1.00     23.7±0.37µs        ? ?/sec    1.02     24.2±1.51µs        ? ?/sec
in_list/Utf8View/list=8/nulls=0%/str=100       1.05     49.6±3.83µs        ? ?/sec    1.00     47.2±0.83µs        ? ?/sec
in_list/Utf8View/list=8/nulls=0%/str=12        1.00     50.5±0.65µs        ? ?/sec    1.18     59.7±1.62µs        ? ?/sec
in_list/Utf8View/list=8/nulls=0%/str=3         1.00     50.6±0.72µs        ? ?/sec    1.17     59.5±0.70µs        ? ?/sec
in_list/Utf8View/list=8/nulls=20%/str=100      1.00     45.4±2.06µs        ? ?/sec    1.06     48.2±1.59µs        ? ?/sec
in_list/Utf8View/list=8/nulls=20%/str=12       1.01     47.3±0.67µs        ? ?/sec    1.00     46.8±0.83µs        ? ?/sec
in_list/Utf8View/list=8/nulls=20%/str=3        1.01     49.1±0.36µs        ? ?/sec    1.00     48.6±1.29µs        ? ?/sec

alamb-ghbot avatar Dec 09 '25 22:12 alamb-ghbot

@Dandandan is already on the list https://github.com/alamb/datafusion-benchmarking/blob/4fb120785fa66ecbf40a45d8a5d0d5f4be17266a/scripts/scrape_comments.py#L41

I can add @geoffreyclaude if he would like

@alamb yes please :) Especially for when/if I follow through with the additional perf improvements.

geoffreyclaude avatar Dec 10 '25 14:12 geoffreyclaude

run benchmarks in_list

adriangb avatar Dec 10 '25 14:12 adriangb

🤖 Hi @adriangb, thanks for the request (https://github.com/apache/datafusion/pull/18832#issuecomment-3637437969).

scrape_comments.py only supports whitelisted benchmarks.

  • Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, tpch, tpch10, tpch_mem, tpch_mem10
  • Criterion: aggregate_query_sql, aggregate_vectorized, case_when, in_list, sql_planner

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...

alamb-ghbot avatar Dec 10 '25 14:12 alamb-ghbot