datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Further improve performance of IN list evaluation

Open geoffreyclaude opened this issue 1 month ago • 6 comments

Related Issues

  • Follow-up to #18824 (Restore IN_LIST performance -- Implement specialized StaticFilters for different data types)
  • POC implementation: https://github.com/pydantic/datafusion/pull/48
  • Micro Benchmarks of the different search algorithms: https://github.com/geoffreyclaude/datafusion/pull/14

Motivation

IN list filters have become critical-path operations for dynamic filter pushdown. When scanning large tables with partition pruning or dynamic filters, the IN expression is evaluated millions of times per query. The current generic implementation leaves significant performance on the table.

The POC demonstrates 30-78% speedups on primitive types and up to 43% speedups on string types by exploiting type-specific optimizations that the compiler cannot infer from generic code.

High-Level Optimization Strategy

1. Const-Generic Branchless Evaluation

For small IN lists (≤16 elements), use compile-time-known array sizes ([T; N] instead of Vec<T>). This enables:

  • Loop unrolling and SIMD vectorization
  • Branch elimination (no conditional jumps)
  • Register-resident data (no heap access)

2. Type Normalization

For equality comparison, only bit patterns matter. Normalize types to reduce code paths:

  • Signed integers → Unsigned (Int32UInt32)
  • Floats → Unsigned (Float32UInt32)
  • Short Utf8View strings (≤12 bytes) → 128-bit integers

This is zero-cost at runtime (buffer pointer cast, no data copy).

3. Tiered Lookup Strategy

Select the optimal algorithm based on list size:

List Size Strategy Rationale
1-16 Branchless OR-chain Parallel comparison in registers
17-32 Binary search O(log n) with good cache locality
>32 Hash set O(1) amortized

4. Short-String Fast Path

Utf8View stores strings ≤12 bytes inline. Reinterpret the 16-byte view struct as a 128-bit integer to turn string comparison into a single integer comparison.

Benchmark Highlights

Most impactful improvements (POC vs. #18832):

Benchmark Speedup
Float32/list=3/nulls=0% -78% (2.20 µs → 485 ns)
Float32/list=8/nulls=0% -77% (2.94 µs → 677 ns)
Int32/list=8/nulls=0% -63% (1.88 µs → 688 ns)
Int32/list=3/nulls=0% -62% (1.41 µs → 531 ns)
Utf8View/list=3/str=3 -43% (2.18 µs → 1.25 µs)
Utf8View/list=3/str=12 -42% (2.19 µs → 1.27 µs)
Utf8/list=100/str=3 -30% (8.02 µs → 5.62 µs)
Slice Search Micro-Benchmark Image

Proposed Implementation Plan

To make this reviewable, I propose splitting into multiple PRs:

PR 1: Const-Generic Branchless Filter for Primitives

  • [ ] Add BranchlessFilter<T, N> with compile-time known sizes (1-16)
  • [ ] Add FilterStrategy enum and tiered select_strategy (branchless/binary/hash)
  • [ ] Add unified PrimitiveFilter<T, S> with LookupStrategy trait
  • [ ] Cover unsigned types: UInt8/16/32/64

PR 2: Type Normalization

  • [ ] Add TransformingFilter wrapper for zero-cost type reinterpretation
  • [ ] Signed → Unsigned: Int8/16/32/64UInt8/16/32/64
  • [ ] Float → Unsigned: Float32UInt32, Float64UInt64

PR 3: Utf8View Short-String Optimization

  • [ ] Implement short-string (≤12 bytes) reinterpretation as i128/Decimal128
  • [ ] Integrate with branchless filter for small lists
  • [ ] Fallback to hash for long strings or large lists

PR 4: Remaining Types (if needed)

  • [ ] Boolean
  • [ ] Decimal128/256
  • [ ] Binary/LargeBinary/BinaryView

Open Questions

  1. Threshold tuning: The cutoffs (16 for branchless, 32 for binary) are based on microbenchmarks. Should we tune for specific workloads?
  2. Code size: Const-generic instantiation creates multiple monomorphized versions. Is the binary size increase acceptable?
  3. Dictionary arrays: Current POC unpacks dictionaries. Should we optimize the dictionary case separately?

geoffreyclaude avatar Dec 09 '25 16:12 geoffreyclaude

Amazing plan!

adriangb avatar Dec 09 '25 17:12 adriangb

What about slice::contains? Seems like it should be somewhere between the const-sized approach and binary search in terms of threshold window.

Dandandan avatar Dec 09 '25 17:12 Dandandan

There is a related bit of work to untangle with is the ScalarValue references that InListExpr is forced to use even if we are starting from an array. It uses these to look up in bloom filters, do predicate pruning, etc. We could make all of the relevant APIs work with an enum of Vec<ScalarValue> (heterogenous lists) or ArrayRefs (homogenous lists) and that would avoid converting array -> ScalarValue and then in some places back to an array (deep in pruning code iirc). Some of this is planning time / build time stuff so it is amortized over the data scans, but some of it happens for each file opened. It's not as big as for each row, but it adds up.

A second thing is that col IN (...) is inefficient when it hits a bloom filter on col if the list is large because it loops over the values in the list. I'm not sure how or where we would do this but in theory we could build a bloom filter out of the InListExpr and then do a binary operation between that bloom filter and Parque's bloom filter instead of looping over each item in the list and looking it up in the columns bloom filter. At the very least if we did the point above and pushed down an array we could probably be more efficient about converting all of the values into something we can look up in the bloom filter (currently it goes through ScalarValue).

adriangb avatar Dec 09 '25 18:12 adriangb

What about slice::contains? Seems like it should be somewhere between the const-sized approach and binary search in terms of threshold window.

It loses all the time against the branchless and the binary search. slice::contains is just a wrapper around:

self.iter().any(|y| y == x)

So it needs to do bounds checks and can't optimize for the small arrays.

geoffreyclaude avatar Dec 09 '25 19:12 geoffreyclaude

What about slice::contains? Seems like it should be somewhere between the const-sized approach and binary search in terms of threshold window.

It loses all the time against the branchless and the binary search. slice::contains is just a wrapper around:

self.iter().any(|y| y == x) So it needs to do bounds checks and can't optimize for the small arrays.

Hmm interesting. I would have thought it would be faster somwehere in between. Note it is not equal to self.iter().any(|y| y == x), slice::contains is specialized for primitive types to create unrolled/vectorized code.

Dandandan avatar Dec 09 '25 21:12 Dandandan

@Dandandan See https://github.com/geoffreyclaude/datafusion/pull/14 for an in-depth micro benchmark and analysis of the different search algorithms.

TL;DR: It's always branchless up to the SIMD limit, then hashset.

Slice Search Benchmark Image

geoffreyclaude avatar Dec 10 '25 14:12 geoffreyclaude

I've opened https://github.com/apache/datafusion/pull/19376 as a preliminary PR to extend the benchmarks.

geoffreyclaude avatar Dec 17 '25 16:12 geoffreyclaude

  1. Type Normalization

Note you can potentially use the RowFormat for this purpose - https://docs.rs/arrow-row/latest/arrow_row/, perhaps as a fallback when more general methods aren't available

It handles pretty much every arrow type

The downside is that the input needs to be first converted into row format

alamb avatar Dec 18 '25 15:12 alamb