single_distinct_to_groupby panic when group by expr is a binaryExpr
Describe the bug
single_distinct_to_groupby panic when group by expr is a binaryExpr, with error message like this:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: SchemaError(FieldNotFound { qualifier: Some("users"), name: "id", valid_fields: Some(["users.id + Int64(1)", "alias1"]) })', datafusion/optimizer/src/single_distinct_to_groupby.rs:117:72
To Reproduce Based on https://github.com/apache/arrow-datafusion/blob/e47c4eb1ac8a5c3c07a0ed2bad74ecd6509bae7b/datafusion-examples/examples/memtable.rs#L39,
Change sql to following will reproduce this error.
SELECT id+1 as id, count(distinct(bank_account))
FROM users
group by id+1
order by id
Expected behavior No error
Additional context Add any other context about the problem here.
Thanks for your reproducer @jiacai2050. Here are some investigations:
Looks like the single_distinct_to_groupby optimizer doesn't consider the original GROUP BY contains expression like a+1 or b%2. It copies GROUP BYs to the inner subquery, which may cause two problems:
- Schema doesn't match. The schema of subquery will become something like
#table.a + Int64(1). And thus the outerGROUP BYcannot get a columnafrom it. - Duplicated evaluation. Expressions in
GROUP BYcannot be calculated twice.
To fix this, we need to alias the "original" GROUP BY in the subquery and refer to that aliased column in outer. E.g.:
- Before optimize
SELECT id+1, count(distinct(bank_account)) FROM users GROUP BY id+1; - After optimize
SELECT group_by_alias1 as "id+1", count(bank_account) FROM ( SELECT id+1 as group_by_alias1, bank_account FROM users GROUP BY id+1, bank_account ) GROUP BY group_by_alias1;
I'll submit a fix around this or next week.
alias is one potential solution, when I debug this issue, I find something strange:
modified datafusion/expr/src/expr_schema.rs
@@ -223,9 +223,9 @@ impl ExprSchemable for Expr {
)),
_ => Ok(DFField::new(
None,
- &self.name(input_schema)?,
- self.get_type(input_schema)?,
- self.nullable(input_schema)?,
+ &self.name(input_schema).unwrap(),
+ self.get_type(input_schema).unwrap(),
+ self.nullable(input_schema).unwrap(),
)),
}
datafusion panic at self.get_type() not self.name(), if the rewritten schema is incomplete, why would self.name success?
It seems a little strange that self.name and self.get_type have different behavior.
It seems a little strange that self.name and self.get_type have different behavior.
Expr::name() is nothing more than a string concat, it won't find whether this column exists in schema. In other words, it won't return an error due to the wrong schema or column:
fn create_name(e: &Expr, input_schema: &DFSchema) -> Result<String> {
match e {
Expr::Alias(_, name) => Ok(name.clone()),
Expr::Column(c) => Ok(c.flat_name()),
Expr::Literal(value) => Ok(format!("{:?}", value)),
Expr::BinaryExpr { left, op, right } => {
let left = create_name(left, input_schema)?;
let right = create_name(right, input_schema)?;
Ok(format!("{} {} {}", left, op, right))
}
.. => {}
}
But get_type and get_nullable can't. The type and nullable info are stored in schema.
BTW, It looks like the input_schema is not used in this function. Maybe we can remove it.
- https://github.com/apache/arrow-datafusion/blob/92e98df2e412632eba1e188a9ee2ec3bf31f20e6/datafusion/expr/src/expr.rs#L312 The comment saying that the name is based on schema, but this code is introduced at 2020/10, maybe @andygrove @alamb could share some ideas whether the schema is needed here.
Anyway, I think adding alias to group by exprs is the way to fix the bug, so you can fix it first.
I didn't follow this entire conversation in detail but " adding alias to group by exprs is the way to fix the bug, so you can fix it first." sounds right to me as well
I can't recall the nuances of Expr::name but it may be from before the time we had reasonable relation support (aka there could well be a bug there)