datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

single_distinct_to_groupby panic when group by expr is a binaryExpr

Open jiacai2050 opened this issue 3 years ago • 5 comments

Describe the bug

single_distinct_to_groupby panic when group by expr is a binaryExpr, with error message like this:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: SchemaError(FieldNotFound { qualifier: Some("users"), name: "id", valid_fields: Some(["users.id + Int64(1)", "alias1"]) })', datafusion/optimizer/src/single_distinct_to_groupby.rs:117:72

To Reproduce Based on https://github.com/apache/arrow-datafusion/blob/e47c4eb1ac8a5c3c07a0ed2bad74ecd6509bae7b/datafusion-examples/examples/memtable.rs#L39,

Change sql to following will reproduce this error.

SELECT id+1 as id, count(distinct(bank_account)) 
FROM users
group by id+1
order by id

Expected behavior No error

Additional context Add any other context about the problem here.

jiacai2050 avatar Aug 01 '22 08:08 jiacai2050

Thanks for your reproducer @jiacai2050. Here are some investigations:

Looks like the single_distinct_to_groupby optimizer doesn't consider the original GROUP BY contains expression like a+1 or b%2. It copies GROUP BYs to the inner subquery, which may cause two problems:

  • Schema doesn't match. The schema of subquery will become something like #table.a + Int64(1). And thus the outer GROUP BY cannot get a column a from it.
  • Duplicated evaluation. Expressions in GROUP BY cannot be calculated twice.

To fix this, we need to alias the "original" GROUP BY in the subquery and refer to that aliased column in outer. E.g.:

  • Before optimize
    SELECT id+1, count(distinct(bank_account))
    FROM users
    GROUP BY id+1;
    
  • After optimize
    SELECT group_by_alias1 as "id+1", count(bank_account)
    FROM (
      SELECT id+1 as group_by_alias1, bank_account
      FROM users 
      GROUP BY id+1, bank_account
    )
    GROUP BY group_by_alias1;
    

I'll submit a fix around this or next week.

waynexia avatar Aug 02 '22 09:08 waynexia

alias is one potential solution, when I debug this issue, I find something strange:

modified   datafusion/expr/src/expr_schema.rs
@@ -223,9 +223,9 @@ impl ExprSchemable for Expr {
             )),
             _ => Ok(DFField::new(
                 None,
-                &self.name(input_schema)?,
-                self.get_type(input_schema)?,
-                self.nullable(input_schema)?,
+                &self.name(input_schema).unwrap(),
+                self.get_type(input_schema).unwrap(),
+                self.nullable(input_schema).unwrap(),
             )),
         }

datafusion panic at self.get_type() not self.name(), if the rewritten schema is incomplete, why would self.name success?

It seems a little strange that self.name and self.get_type have different behavior.

jiacai2050 avatar Aug 02 '22 13:08 jiacai2050

It seems a little strange that self.name and self.get_type have different behavior.

Expr::name() is nothing more than a string concat, it won't find whether this column exists in schema. In other words, it won't return an error due to the wrong schema or column:

fn create_name(e: &Expr, input_schema: &DFSchema) -> Result<String> {
    match e {
        Expr::Alias(_, name) => Ok(name.clone()),
        Expr::Column(c) => Ok(c.flat_name()),
        Expr::Literal(value) => Ok(format!("{:?}", value)),
        Expr::BinaryExpr { left, op, right } => {
            let left = create_name(left, input_schema)?;
            let right = create_name(right, input_schema)?;
            Ok(format!("{} {} {}", left, op, right))
        }
        .. => {}
}

But get_type and get_nullable can't. The type and nullable info are stored in schema.

BTW, It looks like the input_schema is not used in this function. Maybe we can remove it.

waynexia avatar Aug 03 '22 02:08 waynexia

  • https://github.com/apache/arrow-datafusion/blob/92e98df2e412632eba1e188a9ee2ec3bf31f20e6/datafusion/expr/src/expr.rs#L312 The comment saying that the name is based on schema, but this code is introduced at 2020/10, maybe @andygrove @alamb could share some ideas whether the schema is needed here.

Anyway, I think adding alias to group by exprs is the way to fix the bug, so you can fix it first.

jiacai2050 avatar Aug 03 '22 06:08 jiacai2050

I didn't follow this entire conversation in detail but " adding alias to group by exprs is the way to fix the bug, so you can fix it first." sounds right to me as well

I can't recall the nuances of Expr::name but it may be from before the time we had reasonable relation support (aka there could well be a bug there)

alamb avatar Aug 03 '22 21:08 alamb