datafusion
datafusion copied to clipboard
Use `min_value` and `max_value` on statistics to avoid `ExecutionPlan.execute`
Describe the bug
Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.
If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?
To Reproduce
I have a custom TableProvider and ExecutionPlan, where calling execute is somewhat expensive and I want to avoid calling it if no data will match.
The execution plan can return helpful statistics from .statistics(), including for example, for one column:
...
ColumnStatistics {
null_count: Precision::Exact(0),
max_value: Precision::Exact(ScalarValue::Int64(Some(4))),
min_value: Precision::Exact(ScalarValue::Int64(Some(4))),
distinct_count: Precision::Exact(1),
},
E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query value is null, the execute() function is never alled.
But if I query value > 5 or value < 0, the statistic is ignored and execute() is still called.
Expected behavior
min_value and max_value of ColumnStatistics should be used for pruning and the query plan should not require the "slow" execute method to be called.
Additional context
I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?
I've tried this on both main (as of today) and 37.1.0
For anyone else looking for this, the solution is to implement supports_filters_pushdown on the table, then use that to apply filters within scan.
Feel free to close this if you like.
I think using the reported statistics to prune using datafusion::physical_optimizer::pruning::PruningPredicate would be a nice improvement (so each table provider didn't have to apply the basic min/max filters themselves)
Relabed from bug to feature -- thanks @samuelcolvin
@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?
@alamb using
PruningPredicatemakes sense, but please can you point me at where I need to make changes to add this functionality?
I recommend updating the existing though poorly named AggregateStatistics pass
And there you could potentially call ExecutionPlan::statistics() and use PruningPredicate to prove that some predicates can't be true and replace them with PlaceholderRowExec
(Thank you for working on this, bTW)
@samuelcolvin have you already started to write code?
No code yet, if you'd like to work on this, feel free.
Actually I'm on a flight today, so might have some time to work on this.
Progress update, I've got min & max stats pruning "working" in our code, however I immediately ran into #10536, I'll let you know how I get on.
You know it occurs to me that @dmitrybugakov / @jayzhan211 may be working on a similar feature with a different approach on https://github.com/apache/datafusion/issues/10456 🤔 I