Infer data type from schema for `Values` and add struct coercion to `coalesce`
Which issue does this PR close?
Closes #5046 .
Follow up from #12839
- Infer Values from schema if exists
- Array and Coalesce has the similar logic, applies the struct coercion logic to coalesce, while values are quite different, rewrite it's own logic.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
If this change is too large, I can try to split this to several one
Hi @jayzhan211, what is the current status of this PR? I have time to review it if it is ready
Hi @jayzhan211, what is the current status of this PR? I have time to review it if it is ready
This is ready for review, I forgot to turn on it.
Conflict again
I merged up from main and resolved a clippy failure
This should fail to build the initial plan. Instead it fails at some later stages:
> EXPLAIN CREATE TABLE t(a int) AS VALUES (a + a);
+-----------+------+
| plan_type | plan |
+-----------+------+
+-----------+------+
It seems it'sa regression
This is invalid query, but it succeeds
> CREATE TABLE t(a int) AS SELECT x FROM (VALUES (a)) t(x) WHERE false;
0 row(s) fetched.
Elapsed 0.014 seconds.
It seems it'sa regression
@findepi what is your suggested path forward?
From my perspective this PR improves several cases (queries that should run, but currently do not on main) and thus while perhaps not perfect it seems like an improvement even though there are additional areas potentially to improve.
So my personal suggestion is
- address comments as best as possible
- file ticket(s) to track additional issues that were identified during review
- merge this PR
- work on the additional issues as follow on PRs
I am not convinced that that code we eventually should have will be natural evolution of the code here, so it's hard for me to judge whether this is a step in the right direction. The problem is with how table-being-created's schema is passed via a "global variable" to some VALUES being planned, without checking whether those values are direct input to the table being created. They may or may not be.
How should this be solved? I think this is a question to you @alamb more than me. But I can try to guess. For example, when planning CREATE TABLE we could check whether it's input is VALUES and bypass generic sql_values_to_plan in such case, calling a new function with this new functionality.
What are your thoughts?
I am not convinced that that code we eventually should have will be natural evolution of the code here, so it's hard for me to judge whether this is a step in the right direction.
In my mind the additional test cases that pass with this code but not on main represent the step forward
The internal implementation (aka the code in this PR) may well change over time / need a different structure than what is in this PR, but the end interface (aka what SQL can run / what the plans are) should be the same.
I think as long as there is no regression we should move on and file the ticket for the remaining issue. I guess what @findepi metioned is something like insert into values() that is not direct input to the table 🤔 But that is not the problem I tend to solve, so I don't think there is any reason to block this change on other issue
This remains correct only for VALUES that are direct input to CREATE TABLE.
The problem is with how table-being-created's schema is passed via a "global variable" to some VALUES being planned, without checking whether those values are direct input to the table being created. They may or may not be.
I would be nice if there is an example for this. I think it is something like insert into values()?
Thanks @alamb and @findepi
I think as long as there is no regression we should move on and file the ticket for the remaining issue.
Sorry for not following earlier, was on a full-day event yesterday.
There are regressions. I kind of felt it's obvious from the way it works, sorry for not providing good examples earlier.
Before the change
> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.005 seconds.
+---+
| a |
+---+
| 4 |
+---+
> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.078 seconds.
+---+
| a |
+---+
| 4 |
+---+
on current main
Wrong result:
> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.071 seconds.
+---+
| a |
+---+
| 3 |
failure
> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
Arrow error: Cast error: Cannot cast string 'abcd' to value of Int32 type
It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan. After this change, incorrect values ('abcd') is cast to column a instead of the result of the function.
The ideally solution is to find the result type of the function and check whether it matches the column type.
btw, I wonder is this query valid in postgres or elsewhere?
Valid query in postgres
CREATE TABLE t AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);
Invalid query in postgres
CREATE TABLE t (a int) AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);
ERROR: syntax error at or near "AS"
LINE 1: CREATE TABLE t(a int) AS SELECT length(a)::int AS a FROM (VA...
It seems there is no way to insert value together with table if the column type is defined 🤔
It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan.
Values plan is build for VALUES ('abcd') part of the query. The type is known to be Utf8.
The the surrounding query is planned, the length(Utf8) function is known to return the length as a number.
What was accidental about this?
It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan.
Values plan is build for
VALUES ('abcd')part of the query. The type is known to be Utf8. The the surrounding query is planned, thelength(Utf8)function is known to return the length as a number.What was accidental about this?
I think mistakenly output the values on the wrong branch.
Let's find out such a valid query in postgres to make sure we need to support this kind of query in datafusion. And add the test to ensure the coverage.
I have filed https://github.com/apache/datafusion/issues/13124 to track the issues raised above explicitly
thank you @alamb