datafusion-comet Implement Common Subexpression Elimination optimizer rule

What is the problem the feature request solves?

When running TPC-H q1 in Spark/Comet, the expression l_extendedprice#21 * (1 - l_discount#22) appears twice in the query and currently gets evaluated twice. This could be optimized out so that it is only evaluated once. I was able to test this by manually rewriting the query.

Original Query

select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
	sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from
	lineitem
where
	l_shipdate <= date '1998-12-01' - interval '68 days'
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus;

Optimized Query

select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(foo) as sum_disc_price,
	sum(foo * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from (select
          l_returnflag,
          l_linestatus,
          l_quantity,
          l_extendedprice,
          l_extendedprice * (1 - l_discount) as foo,
          l_tax,
          l_discount
  from lineitem
  where
	l_shipdate <= date '1998-12-01' - interval '68 days')
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus;

Timings (Original)

13.752424478530884,
11.648030281066895,
11.35965895652771,
11.335061311721802,
11.383593797683716,
11.291191101074219,
11.31091046333313,
11.351991653442383,
11.32134222984314,
11.374904155731201

Timings (Optimized)

12.734412908554077,
10.684742212295532,
10.157625198364258,
10.06518030166626,
10.043614149093628,
9.986022233963013,
9.939271688461304,
9.925782918930054,
10.024176120758057,
10.018519401550293

Spark UI (Original)

Screenshot from 2024-09-13 12-16-22

Spark UI (Optimized)

Screenshot from 2024-09-13 12-15-51

Describe the potential solution

No response

Additional context

No response

Sep 13 '24 18:09 andygrove

Good finding. I think this kind of optimization should be in Spark optimizer instead.

Sep 13 '24 20:09 viirya

I remember Spark SQL has corresponding optimization rule. But not sure why it doesn't affect the query.

Sep 13 '24 20:09 viirya

Related to this, it would be nice if we could improve the metrics for CometHashAggregate to show the time for evaluating the aggregate input expressions. I am not sure how much work that would be though.

Sep 15 '24 18:09 andygrove

Related to this, it would be nice if we could improve the metrics for CometHashAggregate to show the time for evaluating the aggregate input expressions. I am not sure how much work that would be though.

Sounds good. It should be added into DataFusion hash aggregate operator.

Sep 15 '24 19:09 viirya

Good finding. I think this kind of optimization should be in Spark optimizer instead.

It would make sense for Spark to add this, but I think that it could also be beneficial for DataFusion to support this as a physical optimizer rule.

I filed https://github.com/apache/datafusion/issues/12599

Sep 24 '24 01:09 andygrove

Spark has it, but not at the plan level. Instead they do it as part of their code generation: https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1064-L1098C1

Sep 25 '24 17:09 eejbyfeldt

There is now a DataFusion PR to add this feature: https://github.com/apache/datafusion/pull/13046

Oct 22 '24 14:10 andygrove

The DataFusion PR https://github.com/apache/datafusion/pull/13046 is still waiting for a review. I am adding this issue back onto the 0.6 milestone as a reminder.

Jan 29 '25 15:01 andygrove