datafusion
datafusion copied to clipboard
fix(optimizer): optimize_projections rule bug
Which issue does this PR close?
Closes https://github.com/apache/datafusion/issues/12183
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
The plan looks like aliasing to the same name after optimize_projections
Projection: test.a AS test.a
TableScan: test projection=[a]
I wonder could we even remove the alias here, since the name is the same
The plan looks like aliasing to the same name after
optimize_projectionsProjection: test.a AS test.a TableScan: test projection=[a]I wonder could we even remove the alias here, since the name is the same
Removing the Alias will cause a schema change. 🤔
The plan looks like aliasing to the same name after
optimize_projectionsProjection: test.a AS test.a TableScan: test projection=[a]I wonder could we even remove the alias here, since the name is the same
Removing the Alias will cause a schema change. 🤔
That is why I think the root cause might be the incorrect schema we have.
SELECT "test.a" FROM (SELECT a AS "test.a" FROM test)
The expected schema should be something like.
DFSchema { inner: Schema { fields: [Field { name: "test.a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }.
a is aliased to "test.a". The column should be test."test.a"
Upd:
query I
SELECT a FROM (SELECT a FROM test)
----
1
This is the schema for the query without alias
schema: DFSchema { inner: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }
query I
SELECT b FROM (SELECT a as b FROM test)
----
1
This is the schema with alias
schema: DFSchema { inner: Schema { fields: [Field { name: "b", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }
field_qualifiers is empty for alias case. Does that mean we didn't preserve field_qualifiers for alias. 🤔
field_qualifiers: [Some(Bare { table: "test" })]
The plan looks like aliasing to the same name after
optimize_projectionsProjection: test.a AS test.a TableScan: test projection=[a]I wonder could we even remove the alias here, since the name is the same
Removing the Alias will cause a schema change. 🤔
That is why I think the root cause might be the incorrect schema we have.
SELECT "test.a" FROM (SELECT a AS "test.a" FROM test)The expected schema should be something like.
DFSchema { inner: Schema { fields: [Field { name: "test.a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }.a is aliased to "test.a". The column should be
test."test.a"Upd:
query I SELECT a FROM (SELECT a FROM test) ---- 1This is the schema for the query without alias
schema: DFSchema { inner: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "test" })], functional_dependencies: FunctionalDependencies { deps: [] } }query I SELECT b FROM (SELECT a as b FROM test) ---- 1This is the schema with alias
schema: DFSchema { inner: Schema { fields: [Field { name: "b", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }
field_qualifiersis empty for alias case. Does that mean we didn't preservefield_qualifiersfor alias. 🤔field_qualifiers: [Some(Bare { table: "test" })]
Yes. Possibly because the b field itself is not part of the test table? 🤔
If we make the following changes to the SQL, the b field has field_qualifiers: [Some(Bare { table: "t0" })]
query I
SELECT b FROM (SELECT a as b FROM test) t0
----
1
schema: DFSchema { inner: Schema { fields: [Field { name: "b", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: "t0" })], functional_dependencies: FunctionalDependencies { deps: [] } }
I wonder could we even remove the alias here, since the name is the same
due to my subjective recency bias, https://github.com/apache/datafusion/issues/1468 seems related. would it be easier if output schema (column names) were explicit?
I wonder could we even remove the alias here, since the name is the same
due to my subjective recency bias, #1468 seems related. would it be easier if output schema (column names) were explicit?
Not pretty sure, I think they are unrelated 🤔
I think the issue here is that we lost the qualifier for alias in subquery case
It looks like we need to save the qualifiers in the NamePreserver 🤔, otherwise, we won't be able to distinguish between the unqualified column "test.a" and the qualified column test.a. Their schema_names are the same.
Should we get Column's relation for Alias in to_field?
Expr::Alias(Alias { expr, name, .. }) => {
let relation = match expr.as_ref() {
Expr::Column(c) => c.relation.to_owned(),
_ => None
};
let (data_type, nullable) = self.data_type_and_nullable(input_schema)?;
Ok((
relation,
Field::new(name, data_type, nullable)
.with_metadata(self.metadata(input_schema)?)
.into(),
))
}
Should we get Column's relation for Alias in
to_field?
I think we should directly use the relation from the Alias itself, as that is the purpose of an Alias expression. After an expression is rewritten, by adding an Alias we can ensure that the result of to_field remains the same, including the qualifier and name, thus maintaining the schema unchanged.
Here is an alternate implementation proposed by @jonahgao : https://github.com/apache/datafusion/pull/12341
Here is an alternate implementation proposed by @jonahgao : #12341
thanks @jonahgao @jayzhan211 @alamb