datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support "A column is known to be entirely NULL" in `PruningPredicate`

Open appletreeisyellow opened this issue 1 year ago • 0 comments

Part of #9171

Rationale for this change

What changes are included in this PR?

  1. Add new method PruningPredicate::row_counts() to get the total row counts in each container.
  2. Use the information from PruningPredicate::row_counts() and PruningPredicate::null_counts() to determine containers where columns are entirely NULL. This is done by wrapping a CASE expression around the pruning predicate:
    CASE
      WHEN x_null_count = x_row_count THEN false
      ELSE <current_pruning_predicate>
    END
    

Example 1

If a query has a predicate like:

x = 10

instead of

x_min <= 10 AND 10 <= x_max

to something like

CASE
	WHEN x_null_count = x_row_count THEN false
	ELSE x_min <= 10 AND 10 <= x_max
END

Example 2

Another more complicated example:

x < 5 AND x > 0 OR y = 10

instead of

x_max < 5 AND 0 < x_min OR (y_min <= 10 AND 10 <= y_max)

to something like

# x < 5
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 
END
AND
#  x > 0
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END
OR
# y = 10
CASE
  WHEN y_null_count = y_row_count THEN false
  ELSE y_min <= 10 AND 10 <= y_max
END

Are these changes tested?

Yes, updated and added more test coverage

Are there any user-facing changes?

Yes, there is a new API for PruningPredicate called PruningPredicate::row_counts()

appletreeisyellow avatar Feb 13 '24 22:02 appletreeisyellow