Optimize filters to remove redundant IsNotNull checks
What is the problem the feature request solves?
I am comparing native query plans between Comet and Ballista for TPC-H q1 and noticed a significant difference between the filter expressions ~and performance~:
Comet (~total filter time 7.2 seconds~):
FilterExec: col_6@6 IS NOT NULL AND col_6@6 <= 1998-09-24
Ballista (~total filter time 3.3 seconds~):
FilterExec: l_shipdate@6 <= 10493
The differences are:
- Comet evaluates 3 expressions (And, IsNotNull, LtEq) compared to 1 expression in Ballista (LtEq)
- Comet compares the date to a date literal, Ballista compares to an integer literal
We can likely improve Comet performance by eliding the redundant IsNotNull and And. I am not sure if there is a difference with the date versus int literal, but we should check.
Describe the potential solution
No response
Additional context
No response
The Display implementation for ScalarValue changed between DataFusion 37 (the version that Ballista is using) and the version that Comet version. In the older version, Date32 is shown as an integer literal and now it is shown as a date.
I tested a prototype of optimizing this filter and saw a 7% improvement in filter time for this query. It seems worth implementing.
This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?
This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?
Yes, the TPC-H data in this case is known not to contain nulls, as shown in the Parquet schema below, so the IsNotNull check here is redundant. For TPC-DS where the schema allows nulls, we would still need the check.
$ bdt schema lineitem.parquet/
+-----------------+-------------------+-------------+
| column_name | data_type | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey | Int64 | NO |
| l_partkey | Int64 | NO |
| l_suppkey | Int64 | NO |
| l_linenumber | Int32 | NO |
| l_quantity | Decimal128(11, 2) | NO |
| l_extendedprice | Decimal128(11, 2) | NO |
| l_discount | Decimal128(11, 2) | NO |
| l_tax | Decimal128(11, 2) | NO |
| l_returnflag | Utf8 | NO |
| l_linestatus | Utf8 | NO |
| l_shipdate | Date32 | NO |
| l_commitdate | Date32 | NO |
| l_receiptdate | Date32 | NO |
| l_shipinstruct | Utf8 | NO |
| l_shipmode | Utf8 | NO |
| l_comment | Utf8 | NO |
+-----------------+-------------------+-------------+