pinot Improve parse error handling and error messages.

Work in progress

Currently if we use any reserved keyword as part of query or as a column name in Pinot, we get back a very cryptic(non-useful) Query Compilation error, e.g.

WITH grouping AS (SELECT 1) select * from grouping;

triggers: org.apache.pinot.sql.parsers.SqlCompilationException: Caught exception while parsing query: WITH grouping AS (SELECT 1) select * from grouping Caused by: org.apache.pinot.sql.parsers.parser.ParseException: Encountered "" at line 1, column 1. Was expecting one of: (empty)

In most cases error's position, token and list of expected tokens are wrong.

This PR improves errors generated by parser so that the sql above triggers : Encountered \" \"GROUPING\" \"grouping \"\" at line 1, column 6.\n" + "Was expecting one of:\n" + " <BRACKET_QUOTED_IDENTIFIER> ...

The biggest change in the PR is switch from global lookahead value 2 to 1. Most of the changes in Parser.jj file are fixes for clashes caused by it.

Oct 15 '24 10:10 bziobrowski

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 63.77%. Comparing base (59551e4) to head (55e15ba). Report is 1192 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14238      +/-   ##
============================================
+ Coverage     61.75%   63.77%   +2.02%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2624     +188     
  Lines        133233   144611   +11378     
  Branches      20636    22134    +1498     
============================================
+ Hits          82274    92227    +9953     
- Misses        44911    45569     +658     
- Partials       6048     6815     +767

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	:arrow_up:
integration	`100.00% <ø> (+99.99%)`	:arrow_up:
integration1	`100.00% <ø> (+99.99%)`	:arrow_up:
integration2	`0.00% <ø> (ø)`
java-11	`63.76% <ø> (+2.05%)`	:arrow_up:
java-21	`63.64% <ø> (+2.02%)`	:arrow_up:
skip-bytebuffers-false	`63.77% <ø> (+2.02%)`	:arrow_up:
skip-bytebuffers-true	`63.63% <ø> (+35.90%)`	:arrow_up:
temurin	`63.77% <ø> (+2.02%)`	:arrow_up:
unittests	`63.77% <ø> (+2.02%)`	:arrow_up:
unittests1	`55.43% <ø> (+8.54%)`	:arrow_up:
unittests2	`34.32% <ø> (+6.59%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 15 '24 11:10 codecov-commenter

We should avoid touching the code copy-pasted from Calcite unless absolutely necessary. Modifying the copy pasted code means extra management overhead going forward, which IMO is not worth it. Do you see an easier way to fix these parsing errors? If not, I guess we can instead improve our documentation to help users

Oct 15 '24 19:10 Jackie-Jiang

If babel is the root cause, we can consider using a different config to reduce the non-preserved keyword

Oct 15 '24 19:10 Jackie-Jiang

I don't think this problem can be addressed with better documentation because errors returned by parsers have: A. wrong position B. point to wrong token C. suggest empty list of tokens or wrong tokens That means user can't look at the error position and guess that it's caused by use of reserved keyword (see example in issue description).

Switching to lookahead=1 reveals a number of keyword duplicates in config.fmpp and default_config.fmpp (as compilation errors in parser) and a number of clashes, mostly between productions and nonReservedKeywords. Quick test shows that we could minimize the number of added lookaheads by reducing the number of non-reserved or reserved keywords (the list is definitely too big) .

Oct 16 '24 10:10 bziobrowski

Given we use a slightly modified version of Apache Calcite parser... is it safe to assume the standard Apache Calcite parser has this issue as well? If that is the case I think we should ask in their email list/open an issue there. What do you think?

Oct 21 '24 09:10 gortiz

I've just verified this is an issue for Calcite as well. I've executed sqlline as specified in https://calcite.apache.org/docs/tutorial.html and then I've queried SELECT * FROM emps as current and the returned error is:

> SELECT * FROM emps as current;
Error: Error while executing SQL "SELECT * FROM emps as current": parse failed: Encountered "current" at line 1, column 23.
Was expecting one of:
    <BRACKET_QUOTED_IDENTIFIER> ...
    <QUOTED_IDENTIFIER> ...
    <BACK_QUOTED_IDENTIFIER> ...
    <BIG_QUERY_BACK_QUOTED_IDENTIFIER> ...
    <HYPHENATED_IDENTIFIER> ...
    <IDENTIFIER> ...
    <UNICODE_QUOTED_IDENTIFIER> ... (state=,code=0)

@bziobrowski can you ask in Apache Calcite about that?

Oct 21 '24 09:10 gortiz

Yes, calcite's babel parser has the same issue (core parser is better). It comes mainly from using nonReservedKeywords and fixing it with lookahead=2 setting, not from babel sql conformance . With current parser grammar Lookahead=2 greatly expands the list of 'expected' tokens and breaks error position and token.

Oct 21 '24 15:10 bziobrowski

pinot pinot copied to clipboard

Improve parse error handling and error messages.

Codecov Report

pinot
pinot copied to clipboard