datafusion
datafusion copied to clipboard
Support "A column is known to be entirely NULL" in `PruningPredicate`
Is your feature request related to a problem or challenge?
This is broken out from https://github.com/apache/arrow-datafusion/issues/7869 which is describing a slightly different problem
PruningPredicate
can't be told about columns that are known to contain only NULL
. It can be told which columns have no nulls (via the PruningStatistics::null_counts()
).
Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as
File 1: col_a
File 2: col_a
, col_b
(col_b
was added later)
A predicate like col_a != A AND col_b='bananas'
can not be true
for File 1 (as col_B
is logically NULL
for all rows)
This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from PruningStatistics::min_values()
Describe the solution you'd like
- Add a new method
PruningStatistics::row_counts()
to get the total row counts in each container. - Use the information from
PruningStatistics::row_counts()
andPruningStatistics::null_counts()
to determine containers where columns are entirely NULL - Rewrite the predicate, replacing references to columns known to be
NULL
with aNULL
literal and try to simplify the expressions (e.g.a = 5
-->NULL = 5
-->NULL
)
For the example in this ticket's description with predicate col_a != A AND col_b='bananas'
where col_b
is not known and the relevant container had 100
rows,
- the relevant
PruningStatistics
would returncol_b: {null_count = 100, row_count = 100}
-
PruningPredicate::prune
would determinecol_b
was entirely null, and would rewrite the predicate to becol_a != A AND NULL = 'bananas'
. - The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for
col_b
and thus could be proven to be not true.
Describe alternatives you've considered
No response
Additional context
No response
cc @appletreeisyellow
Thank you @alamb!
One question for this line:
File 2:
col_b
,col_B
(col_b
was added later)
Do you actually mean:
File 2: col_a
, col_b
(col_b
was added later)
Do you actually mean:
Yes, I have update the ticket. Thank you
Thanks for verifying. I plan to work on this issue
One thought I had about this feature was to consider writing predicates like
y = 10
Instead of
y_min <= 10 AND y_max >=10
To something like
CASE
WHEN y_null_count == y_row_count THEN false
ELSE y_min <= 10 AND y_max >=10
END
where y_row_count
is a new column based on the value of PruningStatistics::row_counts
I think this would only be valid for top level expressions in the AND
clause (not any arbitrary sub expression)
It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔
That would be the more elegant (but likely much more substantial) solution in my opinion
I think this would only be valid for top level expressions in the
AND
clause (not any arbitrary sub expression)
Do you mean something like this?
For example, rewriting a complicated predicate like x < 5 AND x > 0
instead of
x_max < 5 AND 0 < x_min
To something like (a)
CASE
WHEN x_null_count = x_row_count THEN false
ELSE x_max < 5 AND 0 < x_min
END
I don't think we want it to be (b)
CASE
WHEN x_null_count = x_row_count THEN false
ELSE x_max < 5
END
AND
CASE
WHEN x_null_count = x_row_count THEN false
ELSE 0 < x_min
END
I think case expression can be wrapped around the OR
clause as well. For example, x = 8 OR x < 0
can be rewrite into
CASE
WHEN x_null_count = x_row_count THEN false
ELSE (x_min <= 8 AND 8 <= x_max) OR x_max < 0
END
If we know that a container has column x
to be entirely NULL
, this container can be skipped, regardless the result of OR
clause is true or false.
For a predicate with more than one columns, we might want to have a WHEN clause for each column.
For example, rewriting a predicate like x < 5 AND x > 0 AND y = 10
into
CASE
WHEN x_null_count = x_row_count THEN false
WHEN y_null_count = y_row_count THEN false
ELSE <rewrite for x < 5 AND x > 0 AND y = 10>
END
It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔
That would be the more elegant (but likely much more substantial) solution in my opinion
I'm interested in rewriting pruning predicate to use the range analysis! For now, I'm still writing in PhysicalExpr
just to see how the code works. It would be nice to refactor into range analysis after I sort out basic logics
After discussion with @alamb, we plan to do the implementation in two phases (i.e. two PRs):
- Turn each sub expression into a case expression
- Simplify the case expression and make it easy to read
1. Turn each sub expression into a case expression
Each sub expression will be rewritten into a case expression instead of wrapping the entire expression into one case expression. Each sub expression has its own case expression will make sure the pruning predict rewrite logic is correct.
For example, x < 5 AND x > 0 OR y = 10
will be rewritten into
# x < 5
CASE
WHEN x_null_count = x_row_count THEN false
ELSE x_max < 5
END
AND
# x > 0
CASE
WHEN x_null_count = x_row_count THEN false
ELSE 0 < x_min
END
OR
# y = 10
CASE
WHEN y_null_count = y_row_count THEN false
ELSE y_min <= 10 AND 10 <= y_max
END
2. Simplify the case expression and make it easy to read
The above example is formatted in a way that's easy to read. In the actual pruning predicate string, there is no new lines and indentation. The above example looks like this in the query explain:
CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END AND CASE WHEN x_null_count = x_row_count THEN false ELSE 0 < x_min END OR CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END
As you can see, the final pruning predict rewrite can be long and hard to read. Therefore, we need phase 2 to improve the readability.
Probably add format, like ()
and new lines to the expression string. I will have a better idea after phase 1 PR is done
There is a draft PR: https://github.com/apache/arrow-datafusion/pull/9223. The major feature is implemented, but there are two things I plan to do before I open it for review (see details).
I plan to continue the work on the week of March 4, 2024.
Here is an update of the plan
1. Turn each sub expression into a case expression
Implemented in https://github.com/apache/arrow-datafusion/pull/9223
2. Simplify the case expression and make it easy to read
To be implemented in a follow-up PR:
- Combine repeated CASE expressions https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525213777
- By default, show part of pruning predicate (i.e., truncated one) so users can know there is pruning predicate, add a config option to turn on the full display https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525215609
- Consider to move
wrap_case_expr
intoPruningExpressionBuilder
https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525209135
I think this feature is done now that https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525209135 is merged so resolving this ticket
@appletreeisyellow if you plan additional work, let's track them in follow on tickets