datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Evaluate use of selection vectors in scan-filter-join operations

Open andygrove opened this issue 1 year ago • 2 comments

What is the problem the feature request solves?

It is very common to have scan -> filter as inputs to a join. The copying of data in the filter can be expensive when the batch contains strings and complex types, and the result of the filter is discarded after the join.

I believe that it would be more efficient to have the join use a selection vector to read inputs from the scanned batch rather than perform a filter.

This issue is for tracking the work to create a small prototype to demonstrate. If succesful, then we can discuss making changes in upstream DataFusion to add support for a new ColumnarValue::ArrayWithSelectionVector and then add a specialization in SortMergeJoin to take advantage of this.

Describe the potential solution

No response

Additional context

No response

andygrove avatar Jul 31 '24 18:07 andygrove

Related issue at arrow-rs: https://github.com/apache/arrow-rs/issues/3620

viirya avatar Jul 31 '24 18:07 viirya

This paper may have useful information:

"Filter Representation in Vectorized Query Execution" https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf

andygrove avatar Aug 01 '24 14:08 andygrove