presto icon indicating copy to clipboard operation
presto copied to clipboard

Fix inserting values for single column of RowType

Open hantangwangd opened this issue 1 year ago • 6 comments

Description

Currently, when we try to execute insert values with single column of RowType, like row(a int, b varchar) or row(r row(a int, b varchar)), it would fail because of incorrect unfolding for the row. This PR fix the problem.

Test Plan

  • Make sure the fix do not affect other test cases
  • Newly added test case in AbstractTestDistributedQueries.testInsert()

Contributor checklist

  • [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • [x] If release notes are required, they follow the release notes guidelines.
  • [x] Adequate tests were added if applicable.
  • [x] CI passed.

Release Notes

== NO RELEASE NOTE ==

hantangwangd avatar Nov 12 '23 16:11 hantangwangd

Lgtm, although introducing a new field in the scope isn't ideal. However, I couldn't identify a better alternative to address this issue

jaystarshot avatar Nov 15 '23 06:11 jaystarshot

Without this change, I see the below behavior -

presto:simple> describe concrete;
 Column |             Type              | Extra | Comment 
--------+-------------------------------+-------+---------
 x      | row("a" integer, "b" varchar) |       |         
(1 row)


--Works
insert into concrete values ROW(ROW(1,'a'));
--Fails
insert into concrete values (ROW(1,'a'));
insert into concrete values ((1,'a'));
insert into concrete values (1,'a');

The difference between the success and failure cases is that the parsed com.facebook.presto.sql.tree.Query has the com.facebook.presto.sql.tree.Values mapped to a ROW(1,'a') instead of a ROW(ROW(1,'a')) I'm not sure if this is correct behavior for all of these failed cases, e.g when using (ROW(1,'a')), IMO, we are explicitly calling out that the input is a row of a row The entry point for the parser is here: https://github.com/prestodb/presto/blob/a89f3e16227338216915405c63fe0e6a697d664c/presto-analyzer/src/main/java/com/facebook/presto/sql/analyzer/BuiltInQueryPreparer.java#L68-L70

IMO, we should investigate if the Analyzer is correct w.r.t to SQL grammar rules in interpreting the types passed for ROW types, I'm not familiar with these rules

aaneja avatar Nov 21 '23 12:11 aaneja

Without this change, I see the below behavior -

presto:simple> describe concrete;
 Column |             Type              | Extra | Comment 
--------+-------------------------------+-------+---------
 x      | row("a" integer, "b" varchar) |       |         
(1 row)


--Works
insert into concrete values ROW(ROW(1,'a'));
--Fails
insert into concrete values (ROW(1,'a'));
insert into concrete values ((1,'a'));
insert into concrete values (1,'a');

The difference between the success and failure cases is that the parsed com.facebook.presto.sql.tree.Query has the com.facebook.presto.sql.tree.Values mapped to a ROW(1,'a') instead of a ROW(ROW(1,'a')) I'm not sure if this is correct behavior for all of these failed cases, e.g when using (ROW(1,'a')), IMO, we are explicitly calling out that the input is a row of a row The entry point for the parser is here:

https://github.com/prestodb/presto/blob/a89f3e16227338216915405c63fe0e6a697d664c/presto-analyzer/src/main/java/com/facebook/presto/sql/analyzer/BuiltInQueryPreparer.java#L68-L70

IMO, we should investigate if the Analyzer is correct w.r.t to SQL grammar rules in interpreting the types passed for ROW types, I'm not familiar with these rules

Thanks for you example. We can just figure out the grammar in your example. The sql grammar for row value is defined here: https://jakewheat.github.io/sql-overview/sql-2011-foundation-grammar.html#row-value-special-case.

Following the grammar, when there is only one column, we can omit the bracket for a row. But when we explicitly use Row(...), we should not simply treat it as a bracket and omit it. So, as I understand, when we execute statement insert into concrete values Row(Row(1, 'a')), we are trying to insert a row with single column data whose type is row(row(int, varchar)). It has the same semantics as insert into concrete values (Row(Row(1, 'a'))), so it's a wrong behavior if it works.

On the contrary, the three failure statements should not fail, because they all trying to insert a row with single column data whose type is row(int, varchar)

hantangwangd avatar Nov 21 '23 15:11 hantangwangd

Following the grammar, when there is only one column, we can omit the bracket for a row. But when we explicitly use Row(...), we should not simply treat it as a bracket and omit it.

I'm not sure what interpretation of the parenthesis by the parser is correct. I will tag in a few other reviewers to comment on this. Let's try to arrive at the correct parser implementation if we can

ANTLR references for how VALUES is resolved for different systems :

https://github.com/prestodb/presto/blob/8242a40e7132946cc2de1645bef5dfdaadcc6a71/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4#L239

https://github.com/antlr/grammars-v4/blob/master/sql/postgresql/PostgreSQLParser.g4#L3123

values_clause
   : VALUES OPEN_PAREN expr_list CLOSE_PAREN (COMMA OPEN_PAREN expr_list CLOSE_PAREN)*
   ;

https://github.com/antlr/grammars-v4/blob/master/sql/snowflake/SnowflakeParser.g4#L79

values_list
    : VALUES '(' value_item (COMMA value_item)* ')'
    ;

https://github.com/antlr/grammars-v4/blob/master/sql/tsql/TSqlParser.g4#L4788

table_value_constructor
    : VALUES '(' exps+=expression_list_ ')' (',' '(' exps+=expression_list_ ')')*
    ;

Presto allows for skipping parenthesis while most other systems do not My gut feeling here is that if we can solve the parenthesis matching that generates a ROW expression vs. one that matches the VALUES list, we could possibly have a more elegant solution

aaneja avatar Nov 22 '23 05:11 aaneja

Presto allows for skipping parenthesis while most other systems do not My gut feeling here is that if we can solve the parenthesis matching that generates a ROW expression vs. one that matches the VALUES list, we could possibly have a more elegant solution

Thanks a lot for your great job of listing the ANTLR references. And I agree with you, if we can explicitly figure out a ROW expression and a line of value list in the parser, this problem could be resolved more elegantly. That means some grammar constraint may be introduced, as postgreSQL or snowflake does. And thanks for inviting other reviewers, they may have better thoughts.

hantangwangd avatar Nov 22 '23 09:11 hantangwangd

Hi @aaneja @tdcmeehan , it's been a long time. Could we decide how to resolve this problem? Changing the grammar constraint of presto could be a solution, but in my opinion not to do so may be better, as it does indeed follow the SQL grammar spec and a lot of use cases and test cases are working based on it.

If not changing the grammar is ok, then I have refactored Scope to make it not so problem customized, and support a mechanism which enable transmit messages from top to bottom when analyzing a statement tree.

What's your opinion? Any thoughts please let me know, thanks!

hantangwangd avatar Jan 25 '24 04:01 hantangwangd