Remove `Alias` from `Expr`
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently a named expression / alias can be encoded by wrapping an existing Expr in an Expr::Alias.
However, this has some side effects:
-
We can encode expressions that don't make sense, such as
(1 + 1 AS x) AS yor(1 AS X) + (1 AS Y)(in SQL it's disallowed by the parser, but it's just to show it is possible to do this in the DataFrame API or the raw API. In dataframes we could do things likeexpr.alias("x").alias("y"). -
Code dealing with
Expralways have to deal with theAliascase, even when it doesn't care. This could lead to more complex code or even subtle bugs.
Describe the solution you'd like
- Create a new struct,
NamedExprthat is used to refer to a named expression
struct NamedExpr {
/// Alias or generated name
name: String,
/// The expression
expr: Expr,
}
-
The
NamedExprnow can be used inside projections, aggregates, etc. -
The function
aliasonDataFrameshould return anNamedExpr -
Add an
impl From<Expr> for NamedExprthat generates a name. -
Some functions can accept Into<NamedExpr> to keep being ergonomic to use.
Describe alternatives you've considered
Additional context
FYI @alamb @houqp
I think NamedExpr sounds like a great idea, personally. 👍
Make sense to me @Dandandan
It's not entirely clear to me how we are going to propagate it through out the code base. But if we can pull it off, that would make the code more readable and maintainable for sure :+1:
I think it is valuable, there are some case code just for handling the Alias::Expr, I'm willing to do this task.
I will push a draft PR for this issue this weekend.
@Dandandan . Hi, I'm a little confused about the details.
Specifically how to unify NamedExpr and Expr. I think option
Now alias in Expr:
Expr is enum as
- Alias
- Column
- ScalarVariable
- ....
After use NamedExpr,The structure is like this
- Alias
- OtherExpr
- Column
- ScalarVariable
- ....
Can we use this struct to present it?
struct NamedExpr {
/// Alias or generated name
name: Option<String>,
/// The expression
expr: Expr,
}
Hi @jackwener .
The idea is that any expression within DataFusion receives a name, so the nodes in a LogicalPlan use the NamedExpr type. The name in this type is either the alias for the expression if any or the generated name, so it doesn't need to be optional.
😅I found that it isn't easy problem, I need to think more about it. Expr is used in many places.
Aliases in SQL is just used in Table and Column in the fromClause and selectClause. So, like Aggregate, aggr_expr is NamedExpr, but group_expr isn't.
In addition, I find some Filter Expr use Alias after rewrite which make logicplan can be connected after rewrited.
It's related with #1319 #1316
It mayn't a good way by using alias. Because we must add alias after write Expr.
BTW, projection is same. we must add alias after rewrite plan
I think it is a future ticket.
I plan on working on this since I think it will solve some problems I am running into when working with expressions and schemas.
I spent some time on this and I am no longer sure that this change makes sense. We have a lot of optimizer rules that recurse through expression trees rewriting them and if we move the Alias variant out of the Expr enum it complicates things because we still need to recurse into the expression contained within the alias. It doesn't feel like the effort here justifies the benefit, IMO.
There are other Expr variants that only make sense in certain contexts as well. We can't have an Expr::Sort in a projection, for example. Maybe we just need some validation rules to ensure plans make sense and better utility methods for dealing with alias in cases where it is not valid?
I run into this recently. When constructing VALUES via API, the Alias expression can be thought to allow aliasing the VALUES columns, but this didn't work, alias got (silently) ignored. From such "unpleasant surprise" perspective, I would consider this as a bug.
To me, alias is not an expression at all. It's a feature of a select clause and would be best modeled as such.
There are other Expr variants that only make sense in certain contexts as well. We can't have an Expr::Sort in a projection, for example.
That's a good point and good example.
To me, in ORDER BY <expr> [ASC/DESC] [NULLS FIRST/LAST], the <expr> part is an expression (any expression), and the other attributes (asc/desc, nulls first/last) are attributes of the sorting. They don't have to be modeled as "an expression".
I run into this recently. When constructing VALUES via API, the Alias expression can be thought to allow aliasing the VALUES columns, but this didn't work, alias got (silently) ignored. From such "unpleasant surprise" perspective, I would consider this as a bug.
I agree it sounds like a bug to me
To me, alias is not an expression at all. It's a feature of a select clause and would be best modeled as such.
Theoretically I think this makes sense
One thing that is different about DataFusion than other systems is that the schemas are based on name (rather than column order) and the names of the columns are derived from the expressions
So the point is that an expression like a + 5 + 1 will result in a column named approximately "a + 5 + 1" -- so when the optimizer passes rewrite expressions they need to preserve the output name, and they do so using Expr::Alias
So in this case
a + 5 + 1
Becomes
a + 6 as "a + 5 + 1"
That rewrite can happen in multiple parts of the plan, not just the select list
There are likely other ways to handle this than Expr::Alias (a separate list on Projection, for example) but I do think alias is used widely
There are other Expr variants that only make sense in certain contexts as well. We can't have an Expr::Sort in a projection, for example.
That's a good point and good example. To me, in
ORDER BY <expr> [ASC/DESC] [NULLS FIRST/LAST], the<expr>part is an expression (any expression), and the other attributes (asc/desc, nulls first/last) are attributes of the sorting. They don't have to be modeled as "an expression".
I think Sort would be an easier thing to remove / fix -- Expr::Sort as an expression is also bad as it means the signatures of fn order_by(...) are in terms of Expr, meaning the compiler can't ensure you are actually passing Expr::Sort when needed
I think Sort would be an easier thing to remove / fix --
Expr::Sortas an expression is also bad as it means the signatures offn order_by(...)are in terms ofExpr, meaning the compiler can't ensure you are actually passingExpr::Sortwhen needed
Indeed. There are a few places where sort expression needs to be special cased.
There are likely other ways to handle this than Expr::Alias (a separate list on Projection, for example) but I do think alias is used widely
I see the point. I would be tempted to go into direction where names are explicit and expressions are just expressions.
Ie to model projects as map<String /* name to assign */, Expr /* value to calculate */>
one thing is how easy it is to get there (i get that likely pretty hard per "alias is used widely") and the other is whether we would want to get there at all (is this the right direction)
one thing is how easy it is to get there (i get that likely pretty hard per "alias is used widely")
I agree
and the other is whether we would want to get there at all (is this the right direction)
I don't have any strong feeling in this regard -- in general Expr::Alias seems to work ok, though there are some places where having to handle it (in filters, etc) have caused some issues in the past (which is why there is Expr::unalias and various other methods)
Besides fixing some problems (described in this issue + https://github.com/apache/datafusion/issues/1468#issuecomment-2308275476), separating "expression" and "named expression" could result in lower mental overhead involved when dealing with the code, given that in many contexts expressions don't have & don't need names (eg VALUES, ORDER BY, function inputs).
I think Sort would be an easier thing to remove / fix --
Expr::Sortas an expression is also bad as it means the signatures offn order_by(...)are in terms ofExpr, meaning the compiler can't ensure you are actually passingExpr::Sortwhen needed
filed https://github.com/apache/datafusion/issues/12193 for this
and created a PR removing Expr::Sort (https://github.com/apache/datafusion/pull/12177)
I'm confused about the relation in Alias. What is it for? Could we remove it? While digging into #12184, I think we don't need to keep relation in Alias, instead get the relation from expr (Specifically Expr::Column) if needed.
I'm confused about the
relationinAlias. What is it for? Could we remove it? While digging into #12184, I think we don't need to keeprelationinAlias, instead get the relation fromexpr(Specifically Expr::Column) if needed.
It may be historical -- Expr::Alias is quite old code I think
Schema / DFSchema has dual purpose and this is related to Expr::Alias existence and handling.
One necessary purpose is understanding source tables, and identifier resolution in the SQL queries - a necessary job to be done once SQL syntax tree (AST) is produced by the parser. The other purpose is for column/symbol/variable resolution between the logical plan components. Some engines (like Trino) use separate abstraction for that purpose. Reusing schema here makes it very hard to solve https://github.com/apache/datafusion/issues/13476, https://github.com/apache/datafusion/issues/6543 (see WIP https://github.com/apache/datafusion/pull/13489#issuecomment-2488695180)
I know addressing this is hard, but in 5 year from now perspective, would we prefer these issues to be still with us? Or would we prefer to address them sooner or later?
I think we should refactor LogicalPlans not to use aliases and column references at all. This can fall under https://github.com/apache/datafusion/issues/12723. We should find a way to do this incrementally, and I believe there is a way to do so.
I know addressing this is hard, but in 5 year from now perspective, would we prefer these issues to be still with us? Or would we prefer to address them sooner or later?
Well I am always for improving the situation!
I think we should refactor LogicalPlans not to use aliases and column references at all. This can fall under #12723. We should find a way to do this incrementally, and I believe there is a way to do so.
I agree -- doing it incrementally is the way to go and the only challenge will be to gather enough people / time to help make it happen.