datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Infer data type from schema for `Values` and add struct coercion to `coalesce`

Open jayzhan211 opened this issue 1 year ago • 1 comments

Which issue does this PR close?

Closes #5046 .

Follow up from #12839

  1. Infer Values from schema if exists
  2. Array and Coalesce has the similar logic, applies the struct coercion logic to coalesce, while values are quite different, rewrite it's own logic.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 avatar Oct 11 '24 10:10 jayzhan211

If this change is too large, I can try to split this to several one

jayzhan211 avatar Oct 18 '24 06:10 jayzhan211

Hi @jayzhan211, what is the current status of this PR? I have time to review it if it is ready

berkaysynnada avatar Oct 21 '24 07:10 berkaysynnada

Hi @jayzhan211, what is the current status of this PR? I have time to review it if it is ready

This is ready for review, I forgot to turn on it.

jayzhan211 avatar Oct 21 '24 08:10 jayzhan211

Conflict again

jayzhan211 avatar Oct 21 '24 08:10 jayzhan211

I merged up from main and resolved a clippy failure

alamb avatar Oct 22 '24 18:10 alamb

This should fail to build the initial plan. Instead it fails at some later stages:

> EXPLAIN CREATE TABLE t(a int) AS VALUES (a + a);
+-----------+------+
| plan_type | plan |
+-----------+------+
+-----------+------+

It seems it'sa regression

findepi avatar Oct 23 '24 06:10 findepi

This is invalid query, but it succeeds

> CREATE TABLE t(a int) AS SELECT x FROM (VALUES (a)) t(x) WHERE false;
0 row(s) fetched.
Elapsed 0.014 seconds.

It seems it'sa regression

findepi avatar Oct 23 '24 07:10 findepi

@findepi what is your suggested path forward?

From my perspective this PR improves several cases (queries that should run, but currently do not on main) and thus while perhaps not perfect it seems like an improvement even though there are additional areas potentially to improve.

So my personal suggestion is

  1. address comments as best as possible
  2. file ticket(s) to track additional issues that were identified during review
  3. merge this PR
  4. work on the additional issues as follow on PRs

alamb avatar Oct 23 '24 10:10 alamb

I am not convinced that that code we eventually should have will be natural evolution of the code here, so it's hard for me to judge whether this is a step in the right direction. The problem is with how table-being-created's schema is passed via a "global variable" to some VALUES being planned, without checking whether those values are direct input to the table being created. They may or may not be.

How should this be solved? I think this is a question to you @alamb more than me. But I can try to guess. For example, when planning CREATE TABLE we could check whether it's input is VALUES and bypass generic sql_values_to_plan in such case, calling a new function with this new functionality.

What are your thoughts?

findepi avatar Oct 23 '24 10:10 findepi

I am not convinced that that code we eventually should have will be natural evolution of the code here, so it's hard for me to judge whether this is a step in the right direction.

In my mind the additional test cases that pass with this code but not on main represent the step forward

The internal implementation (aka the code in this PR) may well change over time / need a different structure than what is in this PR, but the end interface (aka what SQL can run / what the plans are) should be the same.

alamb avatar Oct 23 '24 11:10 alamb

I think as long as there is no regression we should move on and file the ticket for the remaining issue. I guess what @findepi metioned is something like insert into values() that is not direct input to the table 🤔 But that is not the problem I tend to solve, so I don't think there is any reason to block this change on other issue

This remains correct only for VALUES that are direct input to CREATE TABLE.

The problem is with how table-being-created's schema is passed via a "global variable" to some VALUES being planned, without checking whether those values are direct input to the table being created. They may or may not be.

I would be nice if there is an example for this. I think it is something like insert into values()?

jayzhan211 avatar Oct 23 '24 13:10 jayzhan211

Thanks @alamb and @findepi

jayzhan211 avatar Oct 24 '24 00:10 jayzhan211

I think as long as there is no regression we should move on and file the ticket for the remaining issue.

Sorry for not following earlier, was on a full-day event yesterday.

There are regressions. I kind of felt it's obvious from the way it works, sorry for not providing good examples earlier.

Before the change

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.005 seconds.

+---+
| a |
+---+
| 4 |
+---+
> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.078 seconds.

+---+
| a |
+---+
| 4 |
+---+

on current main

Wrong result:

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.071 seconds.

+---+
| a |
+---+
| 3 |

failure

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
Arrow error: Cast error: Cannot cast string 'abcd' to value of Int32 type

findepi avatar Oct 25 '24 07:10 findepi

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan. After this change, incorrect values ('abcd') is cast to column a instead of the result of the function.

The ideally solution is to find the result type of the function and check whether it matches the column type.

btw, I wonder is this query valid in postgres or elsewhere?

Valid query in postgres CREATE TABLE t AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);

Invalid query in postgres CREATE TABLE t (a int) AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);

ERROR:  syntax error at or near "AS"
LINE 1: CREATE TABLE t(a int) AS SELECT length(a)::int AS a FROM (VA...

It seems there is no way to insert value together with table if the column type is defined 🤔

jayzhan211 avatar Oct 25 '24 08:10 jayzhan211

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan.

Values plan is build for VALUES ('abcd') part of the query. The type is known to be Utf8. The the surrounding query is planned, the length(Utf8) function is known to return the length as a number.

What was accidental about this?

findepi avatar Oct 25 '24 08:10 findepi

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan.

Values plan is build for VALUES ('abcd') part of the query. The type is known to be Utf8. The the surrounding query is planned, the length(Utf8) function is known to return the length as a number.

What was accidental about this?

I think mistakenly output the values on the wrong branch.

Let's find out such a valid query in postgres to make sure we need to support this kind of query in datafusion. And add the test to ensure the coverage.

jayzhan211 avatar Oct 25 '24 10:10 jayzhan211

I have filed https://github.com/apache/datafusion/issues/13124 to track the issues raised above explicitly

alamb avatar Oct 26 '24 12:10 alamb

thank you @alamb

findepi avatar Oct 26 '24 17:10 findepi