datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

fix(optimizer): optimize_projections rule bug

Open JasonLi-cn opened this issue 1 year ago • 9 comments

Which issue does this PR close?

Closes https://github.com/apache/datafusion/issues/12183

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

JasonLi-cn avatar Aug 27 '24 01:08 JasonLi-cn

The plan looks like aliasing to the same name after optimize_projections

    Projection: test.a AS test.a
      TableScan: test projection=[a]

I wonder could we even remove the alias here, since the name is the same

jayzhan211 avatar Aug 27 '24 03:08 jayzhan211

The plan looks like aliasing to the same name after optimize_projections

    Projection: test.a AS test.a
      TableScan: test projection=[a]

I wonder could we even remove the alias here, since the name is the same

Removing the Alias will cause a schema change. 🤔

JasonLi-cn avatar Aug 27 '24 04:08 JasonLi-cn

The plan looks like aliasing to the same name after optimize_projections

    Projection: test.a AS test.a
      TableScan: test projection=[a]

I wonder could we even remove the alias here, since the name is the same

Removing the Alias will cause a schema change. 🤔

That is why I think the root cause might be the incorrect schema we have.

SELECT "test.a" FROM (SELECT a AS "test.a" FROM test)

The expected schema should be something like.

DFSchema { inner: Schema { fields: [Field { name: "test.a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }.

a is aliased to "test.a". The column should be test."test.a"

Upd:

query I
SELECT a FROM (SELECT a FROM test)
----
1

This is the schema for the query without alias

schema: DFSchema { inner: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }
query I
SELECT b FROM (SELECT a as b FROM test)
----
1

This is the schema with alias

schema: DFSchema { inner: Schema { fields: [Field { name: "b", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }

field_qualifiers is empty for alias case. Does that mean we didn't preserve field_qualifiers for alias. 🤔

field_qualifiers: [Some(Bare { table: "test" })]

jayzhan211 avatar Aug 27 '24 05:08 jayzhan211

The plan looks like aliasing to the same name after optimize_projections

    Projection: test.a AS test.a
      TableScan: test projection=[a]

I wonder could we even remove the alias here, since the name is the same

Removing the Alias will cause a schema change. 🤔

That is why I think the root cause might be the incorrect schema we have.

SELECT "test.a" FROM (SELECT a AS "test.a" FROM test)

The expected schema should be something like.

DFSchema { inner: Schema { fields: [Field { name: "test.a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }.

a is aliased to "test.a". The column should be test."test.a"

Upd:

query I
SELECT a FROM (SELECT a FROM test)
----
1

This is the schema for the query without alias

schema: DFSchema { inner: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }
query I
SELECT b FROM (SELECT a as b FROM test)
----
1

This is the schema with alias

schema: DFSchema { inner: Schema { fields: [Field { name: "b", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }

field_qualifiers is empty for alias case. Does that mean we didn't preserve field_qualifiers for alias. 🤔

field_qualifiers: [Some(Bare { table: "test" })]

Yes. Possibly because the b field itself is not part of the test table? 🤔 If we make the following changes to the SQL, the b field has field_qualifiers: [Some(Bare { table: "t0" })]

query I
SELECT b FROM (SELECT a as b FROM test) t0
----
1
schema: DFSchema { inner: Schema { fields: [Field { name: "b", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "t0" })], functional_dependencies: FunctionalDependencies { deps: [] } }

JasonLi-cn avatar Aug 27 '24 06:08 JasonLi-cn

I wonder could we even remove the alias here, since the name is the same

due to my subjective recency bias, https://github.com/apache/datafusion/issues/1468 seems related. would it be easier if output schema (column names) were explicit?

findepi avatar Aug 27 '24 08:08 findepi

I wonder could we even remove the alias here, since the name is the same

due to my subjective recency bias, #1468 seems related. would it be easier if output schema (column names) were explicit?

Not pretty sure, I think they are unrelated 🤔

I think the issue here is that we lost the qualifier for alias in subquery case

jayzhan211 avatar Aug 27 '24 11:08 jayzhan211

It looks like we need to save the qualifiers in the NamePreserver 🤔, otherwise, we won't be able to distinguish between the unqualified column "test.a" and the qualified column test.a. Their schema_names are the same.

jonahgao avatar Aug 27 '24 15:08 jonahgao

Should we get Column's relation for Alias in to_field?

            Expr::Alias(Alias { expr, name, .. }) => {
                let relation = match expr.as_ref() {
                    Expr::Column(c) => c.relation.to_owned(),
                    _ => None
                };

                let (data_type, nullable) = self.data_type_and_nullable(input_schema)?;
                Ok((
                    relation,
                    Field::new(name, data_type, nullable)
                        .with_metadata(self.metadata(input_schema)?)
                        .into(),
                ))
            }

jayzhan211 avatar Aug 28 '24 01:08 jayzhan211

Should we get Column's relation for Alias in to_field?

I think we should directly use the relation from the Alias itself, as that is the purpose of an Alias expression. After an expression is rewritten, by adding an Alias we can ensure that the result of to_field remains the same, including the qualifier and name, thus maintaining the schema unchanged.

jonahgao avatar Aug 28 '24 02:08 jonahgao

Here is an alternate implementation proposed by @jonahgao : https://github.com/apache/datafusion/pull/12341

alamb avatar Sep 05 '24 20:09 alamb

Here is an alternate implementation proposed by @jonahgao : #12341

thanks @jonahgao @jayzhan211 @alamb

JasonLi-cn avatar Sep 06 '24 09:09 JasonLi-cn