datafusion
datafusion copied to clipboard
Incorrect results with expression resolution
Describe the bug
DataFusion will sometimes resolve expressions incorrectly when the alias shadows an expression
To Reproduce
DataFusion CLI v37.1.0
> select a + b from (select 1 as a, 2 as b, 1 as "a + b");
+-------+
| a + b |
+-------+
| 1 | <- Should be 3
+-------+
1 row(s) fetched.
Elapsed 0.009 seconds.
Expected behavior
Result should be 3
Additional context
This was found by @peter-toth in https://github.com/apache/datafusion/pull/10396
As @peter-toth mentions in https://github.com/apache/datafusion/pull/10396#issuecomment-2099072200 this appears to be related to name resolution (not CSE). I checked with EXPLAIN VERBOSE and indeed the initial plan appears to be incorret
DataFusion CLI v37.1.0
> explain verbose select a + b from (select 1 as a, 2 as b, 1 as "a + b");
+------------------------------------------------------------+--------------------------------------------------------------------------------------------+
| plan_type | plan |
+------------------------------------------------------------+--------------------------------------------------------------------------------------------+
| initial_logical_plan | Projection: a + b |
| | Projection: Int64(1) AS a, Int64(2) AS b, Int64(1) AS a + b |
| | EmptyRelation |
| logical_plan after apply_function_rewrites | SAME TEXT AS ABOVE |
| logical_plan after inline_table_scan | SAME TEXT AS ABOVE |
| logical_plan after type_coercion | SAME TEXT AS ABOVE |
| logical_plan after count_wildcard_rule | SAME TEXT AS ABOVE |
| analyzed_logical_plan | SAME TEXT AS ABOVE |
| logical_plan after eliminate_nested_union | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after unwrap_cast_in_comparison | SAME TEXT AS ABOVE |
| logical_plan after replace_distinct_aggregate | SAME TEXT AS ABOVE |
| logical_plan after eliminate_join | SAME TEXT AS ABOVE |
| logical_plan after decorrelate_predicate_subquery | SAME TEXT AS ABOVE |
| logical_plan after scalar_subquery_to_join | SAME TEXT AS ABOVE |
| logical_plan after extract_equijoin_predicate | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after rewrite_disjunctive_predicate | SAME TEXT AS ABOVE |
| logical_plan after eliminate_duplicated_expr | SAME TEXT AS ABOVE |
| logical_plan after eliminate_filter | SAME TEXT AS ABOVE |
| logical_plan after eliminate_cross_join | SAME TEXT AS ABOVE |
| logical_plan after common_sub_expression_eliminate | SAME TEXT AS ABOVE |
| logical_plan after eliminate_limit | SAME TEXT AS ABOVE |
| logical_plan after propagate_empty_relation | SAME TEXT AS ABOVE |
| logical_plan after eliminate_one_union | SAME TEXT AS ABOVE |
| logical_plan after filter_null_join_keys | SAME TEXT AS ABOVE |
| logical_plan after eliminate_outer_join | SAME TEXT AS ABOVE |
| logical_plan after push_down_limit | SAME TEXT AS ABOVE |
| logical_plan after push_down_filter | SAME TEXT AS ABOVE |
| logical_plan after single_distinct_aggregation_to_group_by | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after unwrap_cast_in_comparison | SAME TEXT AS ABOVE |
| logical_plan after common_sub_expression_eliminate | SAME TEXT AS ABOVE |
| logical_plan after optimize_projections | Projection: Int64(1) AS a + b |
| | EmptyRelation |
| logical_plan after eliminate_nested_union | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after unwrap_cast_in_comparison | SAME TEXT AS ABOVE |
| logical_plan after replace_distinct_aggregate | SAME TEXT AS ABOVE |
| logical_plan after eliminate_join | SAME TEXT AS ABOVE |
| logical_plan after decorrelate_predicate_subquery | SAME TEXT AS ABOVE |
| logical_plan after scalar_subquery_to_join | SAME TEXT AS ABOVE |
| logical_plan after extract_equijoin_predicate | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after rewrite_disjunctive_predicate | SAME TEXT AS ABOVE |
| logical_plan after eliminate_duplicated_expr | SAME TEXT AS ABOVE |
| logical_plan after eliminate_filter | SAME TEXT AS ABOVE |
| logical_plan after eliminate_cross_join | SAME TEXT AS ABOVE |
| logical_plan after common_sub_expression_eliminate | SAME TEXT AS ABOVE |
| logical_plan after eliminate_limit | SAME TEXT AS ABOVE |
| logical_plan after propagate_empty_relation | SAME TEXT AS ABOVE |
| logical_plan after eliminate_one_union | SAME TEXT AS ABOVE |
| logical_plan after filter_null_join_keys | SAME TEXT AS ABOVE |
| logical_plan after eliminate_outer_join | SAME TEXT AS ABOVE |
| logical_plan after push_down_limit | SAME TEXT AS ABOVE |
| logical_plan after push_down_filter | SAME TEXT AS ABOVE |
| logical_plan after single_distinct_aggregation_to_group_by | SAME TEXT AS ABOVE |
| logical_plan after simplify_expressions | SAME TEXT AS ABOVE |
| logical_plan after unwrap_cast_in_comparison | SAME TEXT AS ABOVE |
| logical_plan after common_sub_expression_eliminate | SAME TEXT AS ABOVE |
| logical_plan after optimize_projections | SAME TEXT AS ABOVE |
| logical_plan | Projection: Int64(1) AS a + b |
| | EmptyRelation |
| initial_physical_plan | ProjectionExec: expr=[1 as a + b] |
| | PlaceholderRowExec |
| | |
| initial_physical_plan_with_stats | ProjectionExec: expr=[1 as a + b], statistics=[Rows=Exact(1), Bytes=Exact(8), [(Col[0]:)]] |
| | PlaceholderRowExec, statistics=[Rows=Exact(1), Bytes=Exact(0), []] |
| | |
| physical_plan after OutputRequirements | OutputRequirementExec |
| | ProjectionExec: expr=[1 as a + b] |
| | PlaceholderRowExec |
| | |
| physical_plan after aggregate_statistics | SAME TEXT AS ABOVE |
| physical_plan after join_selection | SAME TEXT AS ABOVE |
| physical_plan after LimitedDistinctAggregation | SAME TEXT AS ABOVE |
| physical_plan after EnforceDistribution | SAME TEXT AS ABOVE |
| physical_plan after CombinePartialFinalAggregate | SAME TEXT AS ABOVE |
| physical_plan after EnforceSorting | SAME TEXT AS ABOVE |
| physical_plan after OptimizeAggregateOrder | SAME TEXT AS ABOVE |
| physical_plan after ProjectionPushdown | SAME TEXT AS ABOVE |
| physical_plan after coalesce_batches | SAME TEXT AS ABOVE |
| physical_plan after OutputRequirements | ProjectionExec: expr=[1 as a + b] |
| | PlaceholderRowExec |
| | |
| physical_plan after PipelineChecker | SAME TEXT AS ABOVE |
| physical_plan after LimitAggregation | SAME TEXT AS ABOVE |
| physical_plan after ProjectionPushdown | SAME TEXT AS ABOVE |
| physical_plan | ProjectionExec: expr=[1 as a + b] |
| | PlaceholderRowExec |
| | |
| physical_plan_with_stats | ProjectionExec: expr=[1 as a + b], statistics=[Rows=Exact(1), Bytes=Exact(8), [(Col[0]:)]] |
| | PlaceholderRowExec, statistics=[Rows=Exact(1), Bytes=Exact(0), []] |
| | |
+------------------------------------------------------------+--------------------------------------------------------------------------------------------+
77 row(s) fetched.
Elapsed 0.032 seconds.
It is caused by the columnize_expr function. Based on the comments, I guess columnize_expr is for special needs related to aggregation, and it was introduced in PR #55.
Maybe we can reimplement columnize_expr by directly comparing the aggregated output expressions instead of their display names. Let me try to fix this issue.
take
Wow https://github.com/apache/datafusion/pull/55 is a very old one. Thank you @jonahgao -- this will be a great fix