datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

perf: Apply DataFusion's projection pushdown rule

Open andygrove opened this issue 1 year ago • 3 comments

Which issue does this PR close?

Closes https://github.com/apache/datafusion-comet/issues/908

Rationale for this change

Improve performance (and reduce memory overhead) of HashJoinExec by pushing down projection into the join leveraging a rule that already exists in DataFusion.

Here is an example from TPC-DS q3.

Before

Note the two instances of ProjectionExec and no projection displayed in the HashJoinExec.

 AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_3@3 as col_1, col_2@2 as col_2], aggr=[sum]
  ProjectionExec: expr=[col_0@0 as col_0, col_2@2 as col_1, col_1@4 as col_2, col_2@5 as col_3]
    ProjectionExec: expr=[col_0@3 as col_0, col_1@4 as col_1, col_2@5 as col_2, col_0@0 as col_0, col_1@1 as col_1, col_2@2 as col_2]
      HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_1@1)]

After

Now there is only one ProjectionExec and a projection of 4 columns has been pushed into the HashJoinExec. I suspect that we could remove the remaining ProjectionExec with some additional work (in a separate PR).

 AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_3@3 as col_1, col_2@2 as col_2], aggr=[sum]
  ProjectionExec: expr=[col_0@2 as col_0, col_2@3 as col_1, col_1@0 as col_2, col_2@1 as col_3]
    HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_1@1)], projection=[col_1@1, col_2@2, col_0@3, col_2@5]

What changes are included in this PR?

Enable optimizer with projection pushdown rule, which can potentially fuse the projection with other operators.

How are these changes tested?

Existing tests.

andygrove avatar Sep 03 '24 23:09 andygrove

Thanks @andygrove Makes sense to me. Would you mind to share how much is the performance benefit?

I don't expect it to have much impact, but I am running some benchmarks now. I will post results later today.

andygrove avatar Sep 04 '24 15:09 andygrove

The benchmark results are not very exciting, and the improvements could just be noise. However, there is some correlation with the improvements in q18 and q21 that were noted in https://github.com/apache/datafusion/pull/9236#issuecomment-1950234292.

tpch_queries_speedup_abs

andygrove avatar Sep 04 '24 15:09 andygrove

The benchmark results are not very exciting, and the improvements could just be noise. However, there is some correlation with the improvements in q18 and q21 that were noted in apache/datafusion#9236 (comment).

It makes sense. As the projection happens after join in HashJoin operator, it looks more like an early projection from the upper Projection operator. I don't expect this could get some performance on the join.

viirya avatar Sep 10 '24 20:09 viirya