datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support "A column is known to be entirely NULL" in `PruningPredicate`

Open alamb opened this issue 1 year ago • 11 comments

Is your feature request related to a problem or challenge?

This is broken out from https://github.com/apache/arrow-datafusion/issues/7869 which is describing a slightly different problem

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as

File 1: col_a File 2: col_a, col_b (col_b was added later)

A predicate like col_a != A AND col_b='bananas' can not be true for File 1 (as col_B is logically NULL for all rows)

This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from PruningStatistics::min_values()

Describe the solution you'd like

  1. Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
  2. Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
  3. Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

  1. the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
  2. PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
  3. The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

Describe alternatives you've considered

No response

Additional context

No response

alamb avatar Feb 09 '24 02:02 alamb

cc @appletreeisyellow

alamb avatar Feb 09 '24 02:02 alamb

Thank you @alamb!

One question for this line:

File 2: col_b, col_B (col_b was added later)

Do you actually mean: File 2: col_a, col_b (col_b was added later)

appletreeisyellow avatar Feb 09 '24 19:02 appletreeisyellow

Do you actually mean:

Yes, I have update the ticket. Thank you

alamb avatar Feb 09 '24 20:02 alamb

Thanks for verifying. I plan to work on this issue

appletreeisyellow avatar Feb 09 '24 21:02 appletreeisyellow

One thought I had about this feature was to consider writing predicates like

y = 10

Instead of

y_min <= 10 AND y_max >=10

To something like

CASE 
  WHEN y_null_count == y_row_count THEN false
  ELSE y_min <= 10 AND y_max >=10
END

where y_row_count is a new column based on the value of PruningStatistics::row_counts

I think this would only be valid for top level expressions in the AND clause (not any arbitrary sub expression)

alamb avatar Feb 11 '24 10:02 alamb

It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔

That would be the more elegant (but likely much more substantial) solution in my opinion

alamb avatar Feb 11 '24 10:02 alamb

I think this would only be valid for top level expressions in the AND clause (not any arbitrary sub expression)

Do you mean something like this?

For example, rewriting a complicated predicate like x < 5 AND x > 0

instead of

x_max < 5 AND 0 < x_min

To something like (a)

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 AND 0 < x_min
END

I don't think we want it to be (b)

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5
END
AND
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END

appletreeisyellow avatar Feb 13 '24 23:02 appletreeisyellow

I think case expression can be wrapped around the OR clause as well. For example, x = 8 OR x < 0 can be rewrite into

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE (x_min <= 8 AND 8 <= x_max) OR x_max < 0
END

If we know that a container has column x to be entirely NULL, this container can be skipped, regardless the result of OR clause is true or false.

appletreeisyellow avatar Feb 13 '24 23:02 appletreeisyellow

For a predicate with more than one columns, we might want to have a WHEN clause for each column.

For example, rewriting a predicate like x < 5 AND x > 0 AND y = 10 into

CASE
  WHEN x_null_count = x_row_count THEN false
  WHEN y_null_count = y_row_count THEN false
  ELSE <rewrite for x < 5 AND x > 0 AND y = 10>
END

appletreeisyellow avatar Feb 13 '24 23:02 appletreeisyellow

It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔

That would be the more elegant (but likely much more substantial) solution in my opinion

I'm interested in rewriting pruning predicate to use the range analysis! For now, I'm still writing in PhysicalExpr just to see how the code works. It would be nice to refactor into range analysis after I sort out basic logics

appletreeisyellow avatar Feb 13 '24 23:02 appletreeisyellow

After discussion with @alamb, we plan to do the implementation in two phases (i.e. two PRs):

  1. Turn each sub expression into a case expression
  2. Simplify the case expression and make it easy to read

1. Turn each sub expression into a case expression

Each sub expression will be rewritten into a case expression instead of wrapping the entire expression into one case expression. Each sub expression has its own case expression will make sure the pruning predict rewrite logic is correct.

For example, x < 5 AND x > 0 OR y = 10

will be rewritten into

# x < 5
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 
END
AND
#  x > 0
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END
OR
# y = 10
CASE
  WHEN y_null_count = y_row_count THEN false
  ELSE y_min <= 10 AND 10 <= y_max
END

2. Simplify the case expression and make it easy to read

The above example is formatted in a way that's easy to read. In the actual pruning predicate string, there is no new lines and indentation. The above example looks like this in the query explain:

CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END AND CASE WHEN x_null_count = x_row_count THEN false ELSE 0 < x_min END OR CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END

As you can see, the final pruning predict rewrite can be long and hard to read. Therefore, we need phase 2 to improve the readability.

Probably add format, like () and new lines to the expression string. I will have a better idea after phase 1 PR is done

appletreeisyellow avatar Feb 14 '24 16:02 appletreeisyellow

There is a draft PR: https://github.com/apache/arrow-datafusion/pull/9223. The major feature is implemented, but there are two things I plan to do before I open it for review (see details).

I plan to continue the work on the week of March 4, 2024.

appletreeisyellow avatar Feb 26 '24 15:02 appletreeisyellow

Here is an update of the plan

1. Turn each sub expression into a case expression

Implemented in https://github.com/apache/arrow-datafusion/pull/9223

2. Simplify the case expression and make it easy to read

To be implemented in a follow-up PR:

  • Combine repeated CASE expressions https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525213777
  • By default, show part of pruning predicate (i.e., truncated one) so users can know there is pruning predicate, add a config option to turn on the full display https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525215609
  • Consider to move wrap_case_expr into PruningExpressionBuilder https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525209135

appletreeisyellow avatar Mar 14 '24 21:03 appletreeisyellow

I think this feature is done now that https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525209135 is merged so resolving this ticket

@appletreeisyellow if you plan additional work, let's track them in follow on tickets

alamb avatar Mar 22 '24 13:03 alamb