perf: Apply DataFusion's projection pushdown rule
Which issue does this PR close?
Closes https://github.com/apache/datafusion-comet/issues/908
Rationale for this change
Improve performance (and reduce memory overhead) of HashJoinExec by pushing down projection into the join leveraging a rule that already exists in DataFusion.
Here is an example from TPC-DS q3.
Before
Note the two instances of ProjectionExec and no projection displayed in the HashJoinExec.
AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_3@3 as col_1, col_2@2 as col_2], aggr=[sum]
ProjectionExec: expr=[col_0@0 as col_0, col_2@2 as col_1, col_1@4 as col_2, col_2@5 as col_3]
ProjectionExec: expr=[col_0@3 as col_0, col_1@4 as col_1, col_2@5 as col_2, col_0@0 as col_0, col_1@1 as col_1, col_2@2 as col_2]
HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_1@1)]
After
Now there is only one ProjectionExec and a projection of 4 columns has been pushed into the HashJoinExec. I suspect that we could remove the remaining ProjectionExec with some additional work (in a separate PR).
AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_3@3 as col_1, col_2@2 as col_2], aggr=[sum]
ProjectionExec: expr=[col_0@2 as col_0, col_2@3 as col_1, col_1@0 as col_2, col_2@1 as col_3]
HashJoinExec: mode=Partitioned, join_type=Inner, on=[(col_0@0, col_1@1)], projection=[col_1@1, col_2@2, col_0@3, col_2@5]
What changes are included in this PR?
Enable optimizer with projection pushdown rule, which can potentially fuse the projection with other operators.
How are these changes tested?
Existing tests.
Thanks @andygrove Makes sense to me. Would you mind to share how much is the performance benefit?
I don't expect it to have much impact, but I am running some benchmarks now. I will post results later today.
The benchmark results are not very exciting, and the improvements could just be noise. However, there is some correlation with the improvements in q18 and q21 that were noted in https://github.com/apache/datafusion/pull/9236#issuecomment-1950234292.
The benchmark results are not very exciting, and the improvements could just be noise. However, there is some correlation with the improvements in q18 and q21 that were noted in apache/datafusion#9236 (comment).
It makes sense. As the projection happens after join in HashJoin operator, it looks more like an early projection from the upper Projection operator. I don't expect this could get some performance on the join.