datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Optimize filters to remove redundant IsNotNull checks

Open andygrove opened this issue 1 year ago • 4 comments

What is the problem the feature request solves?

I am comparing native query plans between Comet and Ballista for TPC-H q1 and noticed a significant difference between the filter expressions ~and performance~:

Comet (~total filter time 7.2 seconds~):

FilterExec: col_6@6 IS NOT NULL AND col_6@6 <= 1998-09-24

Ballista (~total filter time 3.3 seconds~):

FilterExec: l_shipdate@6 <= 10493

The differences are:

  • Comet evaluates 3 expressions (And, IsNotNull, LtEq) compared to 1 expression in Ballista (LtEq)
  • Comet compares the date to a date literal, Ballista compares to an integer literal

We can likely improve Comet performance by eliding the redundant IsNotNull and And. I am not sure if there is a difference with the date versus int literal, but we should check.

Describe the potential solution

No response

Additional context

No response

andygrove avatar Sep 11 '24 16:09 andygrove

The Display implementation for ScalarValue changed between DataFusion 37 (the version that Ballista is using) and the version that Comet version. In the older version, Date32 is shown as an integer literal and now it is shown as a date.

andygrove avatar Sep 11 '24 18:09 andygrove

I tested a prototype of optimizing this filter and saw a 7% improvement in filter time for this query. It seems worth implementing.

andygrove avatar Sep 11 '24 23:09 andygrove

This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?

parthchandra avatar Sep 16 '24 22:09 parthchandra

This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data?

Yes, the TPC-H data in this case is known not to contain nulls, as shown in the Parquet schema below, so the IsNotNull check here is redundant. For TPC-DS where the schema allows nulls, we would still need the check.

$ bdt schema lineitem.parquet/
+-----------------+-------------------+-------------+
| column_name     | data_type         | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey      | Int64             | NO          |
| l_partkey       | Int64             | NO          |
| l_suppkey       | Int64             | NO          |
| l_linenumber    | Int32             | NO          |
| l_quantity      | Decimal128(11, 2) | NO          |
| l_extendedprice | Decimal128(11, 2) | NO          |
| l_discount      | Decimal128(11, 2) | NO          |
| l_tax           | Decimal128(11, 2) | NO          |
| l_returnflag    | Utf8              | NO          |
| l_linestatus    | Utf8              | NO          |
| l_shipdate      | Date32            | NO          |
| l_commitdate    | Date32            | NO          |
| l_receiptdate   | Date32            | NO          |
| l_shipinstruct  | Utf8              | NO          |
| l_shipmode      | Utf8              | NO          |
| l_comment       | Utf8              | NO          |
+-----------------+-------------------+-------------+

andygrove avatar Sep 22 '24 16:09 andygrove