datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Make `alias_symbol` more human-readable

Open JasonLi-cn opened this issue 9 months ago • 3 comments

Is your feature request related to a problem or challenge?

DataFusion CLI v37.1.0
> select * from number;
+----+----+----+
| c0 | c1 | c2 |
+----+----+----+
| 1  | 2  | 3  |
+----+----+----+
1 row(s) fetched.
Elapsed 0.005 seconds.

> explain select c0 + 1, count(c0 + 1) from number group by c0 + 1;
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0 AS number.c0 + Int64(1)]], aggr=[[COUNT(CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0 AS number.c0 + Int64(1))]] |
|               |   Projection: CAST(number.c0 AS Int64) + Int64(1) AS CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0                                                                                                                 |
|               |     TableScan: number projection=[c0]                                                                                                                                                                                                             |
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[number.c0 + Int64(1)@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))]                                                                                                                    |
|               |   CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                     |
|               |     RepartitionExec: partitioning=Hash([number.c0 + Int64(1)@0], 8), input_partitions=1                                                                                                                                                           |
|               |       AggregateExec: mode=Partial, gby=[CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))]                                                               |
|               |         ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0]                                                                                                            |
|               |           MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                                                                                           |
|               |                                                                                                                                                                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.015 seconds.
ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0]                                                                                                            |

The expr of ProjectionExec is confusing.

The related code: https://github.com/apache/datafusion/blob/b41ef20c5dad7bdd674e3cc5f35a9c99efae676c/datafusion/optimizer/src/common_subexpr_eliminate.rs#L676-L687

Maybe we should make it more human-readable.

Describe the solution you'd like

ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)]                                                                                                            

or

ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)<$SUB_IDEN...>]                                                                                              

Describe alternatives you've considered

No response

Additional context

No response

JasonLi-cn avatar Apr 29 '24 08:04 JasonLi-cn

Hey, this optimizer rule is used by so many test cases, so I wanted to collect opinions before proceeding with implementing the change.

My suggestion

Using let alias_symbol = format!("${{{curr_expr_id}}}"); to mark common expressions. So the query above would be:

 AggregateExec: mode=Partial, gby=[${CAST(number.c0 AS Int64) + Int64(1)}@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))] 
    ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as ${CAST(number.c0 AS Int64) + Int64(1)}] 

Or to make plans shorter, $0, $1, etc...

Some naming alternatives I've considered

  • Using @0, @1, etc ... However, the @ symbol is already used in the physical plan (check the query above), which means that you may have columns like @0@0 which makes no sense.

  • let alias_symbol = format!("{curr_expr_id}").replace(" ", "_");. Sound like a good idea until you see it.

AggregateExec: mode=Partial, gby=[CAST(number.c0_AS_Int64)_+_Int64(1)@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))] 
    ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0_AS_Int64)_+_Int64(1)

It's not immediately obvious + what happens if the column doesn't have a space?

Please feel free to share other alternatives.

MohamedAbdeen21 avatar May 03 '24 20:05 MohamedAbdeen21

https://github.com/apache/datafusion/pull/10333 is a really nice PR 💯

alamb avatar May 06 '24 17:05 alamb

We unfortunately found issues in https://github.com/apache/datafusion/pull/10333 so we are going to revert it in https://github.com/apache/datafusion/pull/10436

See discussion here: https://github.com/apache/datafusion/pull/10436#pullrequestreview-2048448513

alamb avatar May 09 '24 17:05 alamb