pinot
pinot copied to clipboard
Improve parse error handling and error messages.
Work in progress
Currently if we use any reserved keyword as part of query or as a column name in Pinot, we get back a very cryptic(non-useful) Query Compilation error, e.g.
WITH grouping AS (SELECT 1) select * from grouping;
triggers:
org.apache.pinot.sql.parsers.SqlCompilationException: Caught exception while parsing query: WITH grouping AS (SELECT 1) select * from grouping Caused by: org.apache.pinot.sql.parsers.parser.ParseException: Encountered "" at line 1, column 1. Was expecting one of: (empty)
In most cases error's position, token and list of expected tokens are wrong.
This PR improves errors generated by parser so that the sql above triggers :
Encountered \" \"GROUPING\" \"grouping \"\" at line 1, column 6.\n" + "Was expecting one of:\n" + " <BRACKET_QUOTED_IDENTIFIER> ...
The biggest change in the PR is switch from global lookahead value 2 to 1. Most of the changes in Parser.jj file are fixes for clashes caused by it.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 63.77%. Comparing base (
59551e4) to head (55e15ba). Report is 1192 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #14238 +/- ##
============================================
+ Coverage 61.75% 63.77% +2.02%
- Complexity 207 1535 +1328
============================================
Files 2436 2624 +188
Lines 133233 144611 +11378
Branches 20636 22134 +1498
============================================
+ Hits 82274 92227 +9953
- Misses 44911 45569 +658
- Partials 6048 6815 +767
| Flag | Coverage Δ | |
|---|---|---|
| custom-integration1 | 100.00% <ø> (+99.99%) |
:arrow_up: |
| integration | 100.00% <ø> (+99.99%) |
:arrow_up: |
| integration1 | 100.00% <ø> (+99.99%) |
:arrow_up: |
| integration2 | 0.00% <ø> (ø) |
|
| java-11 | 63.76% <ø> (+2.05%) |
:arrow_up: |
| java-21 | 63.64% <ø> (+2.02%) |
:arrow_up: |
| skip-bytebuffers-false | 63.77% <ø> (+2.02%) |
:arrow_up: |
| skip-bytebuffers-true | 63.63% <ø> (+35.90%) |
:arrow_up: |
| temurin | 63.77% <ø> (+2.02%) |
:arrow_up: |
| unittests | 63.77% <ø> (+2.02%) |
:arrow_up: |
| unittests1 | 55.43% <ø> (+8.54%) |
:arrow_up: |
| unittests2 | 34.32% <ø> (+6.59%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
We should avoid touching the code copy-pasted from Calcite unless absolutely necessary. Modifying the copy pasted code means extra management overhead going forward, which IMO is not worth it. Do you see an easier way to fix these parsing errors? If not, I guess we can instead improve our documentation to help users
If babel is the root cause, we can consider using a different config to reduce the non-preserved keyword
I don't think this problem can be addressed with better documentation because errors returned by parsers have: A. wrong position B. point to wrong token C. suggest empty list of tokens or wrong tokens That means user can't look at the error position and guess that it's caused by use of reserved keyword (see example in issue description).
Switching to lookahead=1 reveals a number of keyword duplicates in config.fmpp and default_config.fmpp (as compilation errors in parser) and a number of clashes, mostly between productions and nonReservedKeywords. Quick test shows that we could minimize the number of added lookaheads by reducing the number of non-reserved or reserved keywords (the list is definitely too big) .
Given we use a slightly modified version of Apache Calcite parser... is it safe to assume the standard Apache Calcite parser has this issue as well? If that is the case I think we should ask in their email list/open an issue there. What do you think?
I've just verified this is an issue for Calcite as well. I've executed sqlline as specified in https://calcite.apache.org/docs/tutorial.html and then I've queried SELECT * FROM emps as current and the returned error is:
> SELECT * FROM emps as current;
Error: Error while executing SQL "SELECT * FROM emps as current": parse failed: Encountered "current" at line 1, column 23.
Was expecting one of:
<BRACKET_QUOTED_IDENTIFIER> ...
<QUOTED_IDENTIFIER> ...
<BACK_QUOTED_IDENTIFIER> ...
<BIG_QUERY_BACK_QUOTED_IDENTIFIER> ...
<HYPHENATED_IDENTIFIER> ...
<IDENTIFIER> ...
<UNICODE_QUOTED_IDENTIFIER> ... (state=,code=0)
@bziobrowski can you ask in Apache Calcite about that?
Yes, calcite's babel parser has the same issue (core parser is better). It comes mainly from using nonReservedKeywords and fixing it with lookahead=2 setting, not from babel sql conformance . With current parser grammar Lookahead=2 greatly expands the list of 'expected' tokens and breaks error position and token.