datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Use `min_value` and `max_value` on statistics to avoid `ExecutionPlan.execute`

Open samuelcolvin opened this issue 1 year ago • 11 comments

Describe the bug

Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.

If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?

To Reproduce

I have a custom TableProvider and ExecutionPlan, where calling execute is somewhat expensive and I want to avoid calling it if no data will match.

The execution plan can return helpful statistics from .statistics(), including for example, for one column:

...
ColumnStatistics {
    null_count: Precision::Exact(0),
    max_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    min_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    distinct_count: Precision::Exact(1),
},

E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query value is null, the execute() function is never alled.

But if I query value > 5 or value < 0, the statistic is ignored and execute() is still called.

Expected behavior

min_value and max_value of ColumnStatistics should be used for pruning and the query plan should not require the "slow" execute method to be called.

Additional context

I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?

I've tried this on both main (as of today) and 37.1.0

samuelcolvin avatar May 06 '24 22:05 samuelcolvin

For anyone else looking for this, the solution is to implement supports_filters_pushdown on the table, then use that to apply filters within scan.

Feel free to close this if you like.

samuelcolvin avatar May 07 '24 11:05 samuelcolvin

I think using the reported statistics to prune using datafusion::physical_optimizer::pruning::PruningPredicate would be a nice improvement (so each table provider didn't have to apply the basic min/max filters themselves)

alamb avatar May 08 '24 17:05 alamb

Relabed from bug to feature -- thanks @samuelcolvin

alamb avatar May 08 '24 17:05 alamb

@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?

samuelcolvin avatar May 11 '24 09:05 samuelcolvin

@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?

I recommend updating the existing though poorly named AggregateStatistics pass

And there you could potentially call ExecutionPlan::statistics() and use PruningPredicate to prove that some predicates can't be true and replace them with PlaceholderRowExec

alamb avatar May 13 '24 19:05 alamb

(Thank you for working on this, bTW)

alamb avatar May 13 '24 19:05 alamb

@samuelcolvin have you already started to write code?

edmondop avatar May 15 '24 05:05 edmondop

No code yet, if you'd like to work on this, feel free.

samuelcolvin avatar May 15 '24 05:05 samuelcolvin

Actually I'm on a flight today, so might have some time to work on this.

samuelcolvin avatar May 15 '24 07:05 samuelcolvin

Progress update, I've got min & max stats pruning "working" in our code, however I immediately ran into #10536, I'll let you know how I get on.

samuelcolvin avatar May 15 '24 21:05 samuelcolvin

You know it occurs to me that @dmitrybugakov / @jayzhan211 may be working on a similar feature with a different approach on https://github.com/apache/datafusion/issues/10456 🤔 I

alamb avatar May 15 '24 22:05 alamb