datafusion
datafusion copied to clipboard
Make `alias_symbol` more human-readable
Is your feature request related to a problem or challenge?
DataFusion CLI v37.1.0
> select * from number;
+----+----+----+
| c0 | c1 | c2 |
+----+----+----+
| 1 | 2 | 3 |
+----+----+----+
1 row(s) fetched.
Elapsed 0.005 seconds.
> explain select c0 + 1, count(c0 + 1) from number group by c0 + 1;
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0 AS number.c0 + Int64(1)]], aggr=[[COUNT(CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0 AS number.c0 + Int64(1))]] |
| | Projection: CAST(number.c0 AS Int64) + Int64(1) AS CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0 |
| | TableScan: number projection=[c0] |
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[number.c0 + Int64(1)@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))] |
| | CoalesceBatchesExec: target_batch_size=8192 |
| | RepartitionExec: partitioning=Hash([number.c0 + Int64(1)@0], 8), input_partitions=1 |
| | AggregateExec: mode=Partial, gby=[CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))] |
| | ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.015 seconds.
ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)Int64(1)CAST(number.c0 AS Int64)number.c0] |
The expr
of ProjectionExec
is confusing.
The related code: https://github.com/apache/datafusion/blob/b41ef20c5dad7bdd674e3cc5f35a9c99efae676c/datafusion/optimizer/src/common_subexpr_eliminate.rs#L676-L687
Maybe we should make it more human-readable.
Describe the solution you'd like
ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)]
or
ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0 AS Int64) + Int64(1)<$SUB_IDEN...>]
Describe alternatives you've considered
No response
Additional context
No response
Hey, this optimizer rule is used by so many test cases, so I wanted to collect opinions before proceeding with implementing the change.
My suggestion
Using let alias_symbol = format!("${{{curr_expr_id}}}");
to mark common expressions. So the query above would be:
AggregateExec: mode=Partial, gby=[${CAST(number.c0 AS Int64) + Int64(1)}@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))]
ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as ${CAST(number.c0 AS Int64) + Int64(1)}]
Or to make plans shorter, $0, $1, etc...
Some naming alternatives I've considered
-
Using
@0
,@1
, etc ... However, the @ symbol is already used in the physical plan (check the query above), which means that you may have columns like@0@0
which makes no sense. -
let alias_symbol = format!("{curr_expr_id}").replace(" ", "_");
. Sound like a good idea until you see it.
AggregateExec: mode=Partial, gby=[CAST(number.c0_AS_Int64)_+_Int64(1)@0 as number.c0 + Int64(1)], aggr=[COUNT(number.c0 + Int64(1))]
ProjectionExec: expr=[CAST(c0@0 AS Int64) + 1 as CAST(number.c0_AS_Int64)_+_Int64(1)
It's not immediately obvious + what happens if the column doesn't have a space?
Please feel free to share other alternatives.
https://github.com/apache/datafusion/pull/10333 is a really nice PR 💯
We unfortunately found issues in https://github.com/apache/datafusion/pull/10333 so we are going to revert it in https://github.com/apache/datafusion/pull/10436
See discussion here: https://github.com/apache/datafusion/pull/10436#pullrequestreview-2048448513