datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Consolidate interval analysies from `Interval` and `PruningPredicate`

Open alamb opened this issue 1 year ago • 5 comments

Is your feature request related to a problem or challenge?

We now have two ways to do range / interval analysis in DataFusion.

Having two representations is challenging because we have to implement the same logic in two places. For example,

  • supporting columns known to be NULL that @appletreeisyellow is working on will only affect PruningPredicates
  • Supporting LIKE or substr would require different code in different places
  • Support for IN lists added to PruningPredicate was not added to Interval arithmetic

The rewrite used by the PruningPrediate logic is tricky to understand and only handles very specific predicate forms (see https://github.com/apache/arrow-datafusion/pull/9184 for an essay on the topic and https://github.com/apache/arrow-datafusion/issues/9230 for an example of getting it wrong). Thus it is hard to extend the number of functions / types of predicates that are supported.

The existing range analysis are:

Interval based analysis

The ExprIntervalGraph library is used for cardinality estimation and range analysis for the symmetric hash join, and it:

  1. Can handle arbitrary expressions (such as a < b, not just constants a < 5)
  2. Has a story for how it would support user defined functions (each function would implement a range analysis)

Pruning Predicate Pruning Predicate is used to prune row groups based on min/max values which:

  1. Is vectorized (is efficient to evaluate over 1000s of containers)
  2. Handles the common cases of col <op> constant (such as a = 5, or a < 100), and conjunctions of them
  3. It is not clear (to me at least) how the rewrite that is used can handle arbitrary expressions (e.g. a < b)

Describe the solution you'd like

I would like to rewrite PruningPredicate to use ExprIntervalGraph, measuring and possibly improving the performance of ExprIntervalGraph

The benefits would be:

  1. We would have a single code path to extend and maintain
  2. We would have a clear path for handling arbitrary predicates to prune row groups

Describe alternatives you've considered

Doing so would likely require extending the interval analysis to support more operators (like IN lists) to reach feature parity with the current PruningPredicate rewrite

Additional context

There was a lot of discussion of this topic on the PR that originally introduced Intervals: https://github.com/apache/arrow-datafusion/pull/5322#discussion_r1114324054 between @ozankabak @metegenez and myself

alamb avatar Oct 20 '23 16:10 alamb

@ozankabak and @tustvold I swear we have talked about this topic before but I could not find an existing ticket or discussion. Do you have any other pointers to past discussions?

alamb avatar Oct 20 '23 16:10 alamb

I am not sure this is what you searched for but there was an issue https://github.com/apache/arrow-datafusion/issues/5535.

Actually, I have tried to apply cp_solver strategy to prune row groups. But we observed a performance degradation since this method sacrifices vectorized computing power, meaning that the process needs to be run for each set of statistics. As the number of sets increases, the efficiency decreases.

I will again think about how to insert Interval library there without sacrificing performance.

berkaysynnada avatar Oct 21 '23 13:10 berkaysynnada

I am not sure this is what you searched for but there was an issue https://github.com/apache/arrow-datafusion/issues/5535.

Thank you -- this is exactly what I was looking for.

I will again think about how to insert Interval library there without sacrificing performance.

I may be able to help with this too. One way is use Datum (from arrow) (rather than ScalarValue)in theInterval` representation -- that way each can store a single value or multiple.

alamb avatar Oct 21 '23 20:10 alamb

I just updated this ticket's description with a more coherent story and examples that @appletreeisyellow and I have hit recently while working on in https://github.com/apache/arrow-datafusion/issues/9171

We were talking today and I think @appletreeisyellow may try to prototype what this solution could look like, if she has time, to move the conversation forward.

alamb avatar Feb 14 '24 16:02 alamb

Thank you @alamb for the updating the description

appletreeisyellow avatar Feb 14 '24 17:02 appletreeisyellow