JSqlParser
JSqlParser copied to clipboard
Enhanced Keywords
Rewrite the Reserved Keywords Logic:
- Automatically Derive all Keyword Tokens from the Parser.
- Explicitly add Reserved Keywords to the Grammar and classify, why/when it is reserved.
- Auto-generate the list of Allowed Keywords as Difference of 1) and 2) above
- Provide Gradle task to semi-auto generate
RelObjectNameWithoutValue()based on 3) - Parametrized Tests for Keywords, testing all keywords
Advantage: a) we have now a clear documentation which Keywords are reserved for what reason, with proper tests b) allowed keywords are generated (semi-)automatically, especially when more Tokens are added to the Grammar. New Tokens won't break SQL Statements, which have been parsed fine before.
To-Dos:
c) Composite Tokens do not work well and still need to be added manually to ALL_KEYWORDS (we would need to refactor the GRAMMAR in order to avoid such composite tokens)
d) @wumpz: Clarify the meaning/purpose of the various RelObjectNamexxxx() Productions, so we can auto generate them too.
e) Use the Gradle task to fully inject the RelObjectNamexxxx() Productions (instead of manually updating the code)
Resolves one more Special Oracle Test. Fixes #1148 Fixes #1450 Fixes #1443 Fixes #1462 Fixes #1508 Fixes #1538
@Wumpz: Please check out the great work on PR #1254. It adds a new token QUICK and which of course should not become a Reserved Keyword.
Of course the author @OlivierCavadenti has no chance to know, that he would need to add QUICK to the productions RelObjectNamexxxx() in order to enable this token as Object Identifier. (And how should he, I figured that out only 6 month after my first contribution.)
This PR would solve this challenge reliably, preventing new tokens from breaking valid statements and ease the entry for new contributors.
The naming of this RelObjectName ... methods is sure an issue that should be adressed in futher improvements. So sorry for that: its grown ... ;)
I am not a fan of injecting those keywords, since not every keyword could be introduced this way and needs other parts of the grammar to be modified. However, building this keyword testing using JUnit 5 is cool.
The naming of this RelObjectName ... methods is sure an issue that should be adressed in futher improvements. So sorry for that: its grown ... ;)
You would do me a great favor by documenting the purpose of those 5 methods verbosely. I tried my best to figure it out by trial'n error but I am still not exactly sure why we ended up with 5 methods. (And no worries about grown code.)
I am not a fan of injecting those keywords, since not every keyword could be introduced this way and needs other parts of the grammar to be modified. However, building this keyword testing using JUnit 5 is cool.
Honestly, I am not a big fan of that PR either -- but after spending many days on that matter trying various things it was the best approach I found and in my opinion it is still much better than the current state:
-
at the moment, Keywords are just a mess and we get a lot of issues about keywords only. Are you able to document right now, what keywords are reserved and for what reason? I do not think so and the PR would solve this documentation problem.
-
today, even when you have an opinion on keywords, we have no parametrized tests. You kindly admit, that these tests are kind of cool and I appreciate your feedback. Those parametrized tests however depend on a well defined list of keywords, which the PR provides. (Although we can argue of course, how/where exactly such a list should be defined.)
not every keyword could be introduced this way and needs other parts of the grammar to be modified
- While that is true for the PR, it is also true by now already. Also we could easily enforce a policy of "Keywords are defined by simple Tokens" (not allowing complex Tokens for Keywords).
The work flow is to: a) define the Tokens and Productions in the Grammar first and then b) document any new RESERVED Keywords only in the List (emphasis on RESERVED).
The PR has one big advantaged: By defining RESERVED keywords only, it will determine automatically any allowed Keywords and modify the Grammar semi-automatically.
Allow me to ask please:
-
what was your preferred way to manage and document RESERVED KEYWORDs
-
what was your preferred way to maintain any allowed Token/Keyword in
RelObjectName() -
what was your preferred way to test all the possible Keywords (automatically)
And can we have a Skype chat about this :-)
The reserved keywords are introduced for syntactic rules. I do not get what you mean by documenting the keywords. The documentation or meaning of a keyword comes from its usage in syntactic rules. The introduction into RelOjectName or the other methods came by request.
The split of this method came from the need to avoid LOOKAHEADS. Processing those changes the parsing order. Introducing it automatically or in large numbers changes the parse order in a not predictable way. So I avoided introducing new LOOKAHEADs wherever possible.
That's why I am so reluctant to introduce those automatically.
The reserved keywords are introduced for syntactic rules. I do not get what you mean by documenting the keywords.
How exactly does a user know what keywords are reserved and which ones he has to quote? In H2, in gets the list from the website. In Oracle, he get the list from the website. And in JSQLParser?
How do we set up a consistent Use Case Test which automatically tests all keywords? And how do we ensure, that it is maintained properly after every change and adding new keywords?
So what I have done is do extract all tokens and to allow them, as long as those are not reserved. It answers reliably both problems mentioned above.
The documentation or meaning of a keyword comes from its usage in syntactic rules. The introduction into RelOjectName or the other methods came by request. The split of this method came from the need to avoid LOOKAHEADS. Processing those changes the parsing order. Introducing it automatically or in large numbers changes the parse order in a not predictable way. So I avoided introducing new LOOKAHEADs wherever possible.
That's why I am so reluctant to introduce those automatically.
We agree that LOOKAHEADs are bad and need to be avoided. However, we are arguing if avoiding LOOKAHEADs justifies false parser errors (from keywords artificially not allowed).
The current approach of "Enabling on request" is half baken and "nicht Fisch or Fleisch", it just dodged the discussion. Over the last 18 month most of the issues reported were about "keyword not supported, why?".
So, if you do not like the suggested solution, then please decide How exactly Keywords are to be maintained and documented.
The reserved keywords are introduced for syntactic rules. I do not get what you mean by documenting the keywords.
How exactly does a user know what keywords are reserved and which ones he has to quote? In H2, in gets the list from the website. In Oracle, he get the list from the website. And in JSQLParser?
JSqlParser does not provide it. That's right, but using JSqlParser does often not allow to change the underlying SQLs. Therefore the feature requests to allow this keyword as an identifier. But since we do not stick to one DBMS and could not build a "clean" grammar we have to go the other way around. Would'nt it more hurt, to give a list of reserved keywords because each could be discussed and even allowed.
How do we set up a consistent Use Case Test which automatically tests all keywords? And how do we ensure, that it is maintained properly after every change and adding new keywords?
Since reserved keywords are transformed to allowed keywords by feature request, til now test cases were the result that is still and will be part of JSqlParser. So you are right, that this could be forgotten. So maybe I did not get what you are trying to achive. I will read on ... :)
So what I have done is do extract all tokens and to allow them, as long as those are not reserved. It answers reliably both problems mentioned above.
The documentation or meaning of a keyword comes from its usage in syntactic rules. The introduction into RelOjectName or the other methods came by request. The split of this method came from the need to avoid LOOKAHEADS. Processing those changes the parsing order. Introducing it automatically or in large numbers changes the parse order in a not predictable way. So I avoided introducing new LOOKAHEADs wherever possible. That's why I am so reluctant to introduce those automatically.
We agree that LOOKAHEADs are bad and need to be avoided. However, we are arguing if avoiding LOOKAHEADs justifies false parser errors (from keywords artificially not allowed).
What do you mean by false parse errors? Those errors occur if a reserved keyword is used. JSqlParser needs to disallow some keywords to allow a multi database grammar. If you mean by false parse errors, that JSqlParser fails to parse a statement a specific database could process, then those exists.
The current approach of "Enabling on request" is half baken and "nicht Fisch or Fleisch", it just dodged the discussion. Over the last 18 month most of the issues reported were about "keyword not supported, why?".
So, if you do not like the suggested solution, then please decide How exactly Keywords are to be maintained and documented.
Just wait with your response, I will recheck your implementation. Still, I think, I missed something.
So if I understand this right, that is the workflow:
- build the parser using javacc
- compile it
- run to get the derived RelObjectNames rule (and the others)
System.out.println( buildGrammarForRelObjectNameWithoutValue() );
System.out.println( buildGrammarForRelObjectName() );
- I assume you copy and paste those into the grammar file.
- build the parser using javacc
- compile it
- use it
However, my misunderstanding was from were you are deriving those keywords. So the list of keywords is somehow fixed (at the moment in CCJSqlParser.getDefinedKeywords). I follow your idea to have a fixed list of those keywords including there lets call it visibility type and to build a complete test for those.
What I dislike is this roundtrip. Why do we need to process JavaCC to be able to process JavaCC. So the list of keywords should be grammar externalized so you could do something like:
- process a list of keywords (text file, ...)
- change the specific parts in the grammar: CCJSqlParser.getDefinedKeywords and production rules. (this one should be done in maven / gradle alone without a compile)
- build the parser
- running tests using CCJSqlParser.getDefinedKeywords
Thank you so much for this discussion, I do appreciate looking into this topic with you.
So if I understand this right, that is the workflow: 1. build the parser using javacc 2. compile it 3. run to get the derived RelObjectNames rule (and the others) 4. I assume you copy and paste those into the grammar file. 5. build the parser using javacc 6. compile it 7. use it
Yes, this is correct although we both agree, that it can't stay like that.
As soon as we reach an agreement, on what to do I would like to implement a Gradle task fully automating this. (Gradle has the great advantage, that you can write Java/Groovy code inside the build-script.)
However, my misunderstanding was from were you are deriving those keywords. So the list of keywords is somehow fixed (at the moment in CCJSqlParser.getDefinedKeywords). I follow your idea to have a fixed list of those keywords including there lets call it visibility type and to build a complete test for those.
Yes, the Keywords have to be defined explicitly, including the reason for being a keyword (either SQL Standard or JSQLParser Grammar). From this list we can derive:
- The list of keywords for the documentation
- The list of allowed tokens as difference between All Tokens and Keyword Token
- The parametrized tests for allowed tokens
What I dislike is this roundtrip.
I do not like it either, but even after thinking about it I did not find any better solution. Any idea was most welcome of course.
Why do we need to process JavaCC to be able to process JavaCC.
Because we need to get ALL TOKENS first, which are available only after JTree ran over the grammar. Only when we have ALL TOKENS, we can subtract the Keywords and define RelObjectNames and friends.
(Only alternative was to write/use a Parser for the Grammar, but then you replaced running JTree over the grammar with running the other parser over the grammar.)
So the list of keywords should be grammar externalized so you could do something like:
1. process a list of keywords (text file, ...) 2. change the specific parts in the grammar: **CCJSqlParser.getDefinedKeywords** and production rules. (this one should be done in maven / gradle alone without a compile) 3. build the parser 4. running tests using **CCJSqlParser.getDefinedKeywords**
This was not impossible, however I see some challenges:
- How do you want to get ALL TOKENS?
- Maintaining the Keywords outside the Grammar means again a disconnect between Grammar and Configuration. How will this work, when another Contributor adds new Tokens?!
- It is still a kind of a roundtrip, you just hid it by saying "should be done in maven/gradle". We fully agree that the roundtrip needs to be automated and I am committed to implement that -- but only after we agree on the general solution (because its a lot of work).
BTW, rebuilding RelObjectNames and friends is only necessary, when new Token have been added.
The situation I have in mind is: User submits a PR and introduces new Token, but is not aware that those must be allowed or else would become restricted keywords. (Like it happened to me when starting work on JSQLParser, I broke quite a few working statements during 4.2 and 4.3)
Nevertheless, I will wrap it into proper Gradle tasks of course.
The Gradle Task updateKeywords now updates and replaces RelObjectNameWithoutValue() automatically without manual interaction.
It is still an optional call, so a full build would look alike:
gradle updateKeywords
gradle build
I took some days to think about this again. The goal is to automate all allowed keywords tests. Right? Maybe the documentation.
So this externalization of keywords in getReservedKeywords is a helper method to build parameterized tests for all this stuff. But this results in this weird workflow regenerating the grammer.
Wouldn't it be better to parse the grammar file and simply extract those list the getReservedKeywords provides? After that all automated tests could be done like you propose and no regenerating the grammar is needed.
By the way, renaming all RelObjectNames methods is a must.
I took some days to think about this again. The goal is to automate all allowed keywords tests. Right? Maybe the documentation.
Yes, and also: to automatically allow all Tokens, that are not reserved keywords (especially after adding new tokens).
So this externalization of keywords in
getReservedKeywordsis a helper method to build parameterized tests for all this stuff. But this results in this weird workflow regenerating the grammer.
No. getReservedKeywords is needed for knowing, which tokens to allow in RelObjectName() and friends.
Wouldn't it be better to parse the grammar file and simply extract those list the
getReservedKeywordsprovides? After that all automated tests could be done like you propose and no regenerating the grammar is needed.
I think, you are still miss the main point: Tests and Documentation are nice side products. The main goal though is to Allow the permitted Tokens (as difference between All Tokens and the Reserved Keywords), whenever new Tokens are added.
By the way, renaming all RelObjectNames methods is a must. Fully agreed.
Can we have a call on this topic maybe? I feel we could clarify the approach within a few minutes when we just talk about it.
So this externalization of keywords in
getReservedKeywordsis a helper method to build parameterized tests for all this stuff. But this results in this weird workflow regenerating the grammer.
Actually, not much has changed: in 99% of all cases, you just build as before. Only when a new Token has been defined, you would run gradle buildKeywords first and even then it is optional (at the cost of Allowed Tokens and Documentation).
You are right, but if we read out those keywords directly from the grammar file we never need to run this buildKeywords because always all tests use the actual list. As always the documentation should be much more improved as well as the source code documentation (if I may make this difference).
Greetings.
I have modified the approach and read the Tokens directly from the Grammar file without invoking JTREE. Thus all the Keywords have been moved our of the Grammar into ParserKeywordUtils and we spare the redundant run.
It handles also the Composite Tokens.
Cheers.
I did not look into your modifications, but I think I managed to parse the grammar using JavaCCs classes. (https://github.com/javacc/javacc/issues/221)
So if I understand your modifications right, you are scanning for all tokens defined in the token part. These different RelObjectName productions are used for different purposes. So depending on in which a token is used the tests should be different right?
I did not look into your modifications, but I think I managed to parse the grammar using JavaCCs classes. (javacc/javacc#221)
Yes, I saw that. Thank you for your effort. I am not sure, if this is better or worse than Parsing the Grammar per Regex because again, your engage the JavaCC Parser (now only from inside Java instead of from Gradle/Maven Plugin before).
At the same time i really do not care much as long as we get the Content of all tokens reliably (which seems to work both ways).
So if I understand your modifications right, you are scanning for all tokens defined in the token part. Yes, but as per REGEX (because I do not want to rely on the
K_orS_prefixes and also want to get the keywords of the compound tokens.
These different RelObjectName productions are used for different purposes. So depending on in which a token is used the tests should be different right?
Yes, thats why there are two tests sofar.
So these two tests are enough? So we could merge some of those RelObjectName productions?
How we extract the keywords does not matter that much. What matters to me is the direct use of keywords without rebuilding a list, rebuilding the grammar building the parser, and so on.
However, to read out all keywords correctly one has to rebuild this part of JavaCCs grammar parser, so why not reuse the original and not having troubles with some incomplete regular expressions?
So these two tests are enough? So we could merge some of those RelObjectName productions?
I do see this PR as a first step to move us into a better direction. Certainly more fine tuning will need to follow. Although I feel like this PR is quite massive already and I would like to see a Merge first, before touching anything else.
This particular development step marries the best out of both worlds:
- it demonstrates the idea and the benefits
- it is still fully compatible with what has been there before and does not break anything
Please understand, that maintaining multiple PRs in parallel is quite a nightmare for any contributor and a lot of effort goes into keeping up with origin/master, depending on your Merge order.
How we extract the keywords does not matter that much. What matters to me is the direct use of keywords without rebuilding a list, rebuilding the grammar building the parser, and so on.
However, to read out all keywords correctly one has to rebuild this part of JavaCCs grammar parser, so why not reuse the original and not having troubles with some incomplete regular expressions?
Fully agreed, I would provide a similar method using JavaCC instead of the RegEx (so we could switch on/off later on demand). JavaCC would be the default, but we could still use the RegEx whenever we would hit a snag.
I implemented they getAllKeywords() method based on JTree and got it working.
Although there are pros and cons:
- We really get ALL keywords even from when not declared as a Token -- that's awesome!
- The implementation is quite complex and verbose
- We carry now JavaCC as a dependency, although we needed it only for the compilation -- I do not like that at all, but have no good idea what to do about that.
I still do not get, why you want to hold this keyword list.
We want to:
- Create/Rewrite
RelObject....automatically during the build process (at least whenever new tokens/productions have been added to the Grammer)
We will use the difference between ALL keywords and RESERVED keywords for this rewrite. ALL keywords parsed from the Grammer, RESERVED keywords from the manually defined list.
Since we could extract all keywords, like you mentioned, and read all RelObj .. methods, we have all lists we need to get the reserved keyword list directly from the grammar without rewriting it.
It is the other way around: we get the ALLOWED keywords and rewrite RelObjectNames... accordingly.
Sure as I already mentioned, the RelObj - methods needs to be renamed to better reflect there meaning and use.
Certainly, I have asked you above for an explanation, what you had in mind when introducing the 4 methods. (The first 2 are clear to me, but I struggle to understand the methodology behind ...Ext and ...Ext2).
To revive this.
I still do not get, why you want to hold this keyword list.
We want to:
- Create/Rewrite
RelObject....automatically during the build process (at least whenever new tokens/productions have been added to the Grammer)
Again I don't get why you want to create the RelObject productions during the build process. Why shouldn't the grammar be the only point of defining keywords and reserved keywords?
We will use the difference between ALL keywords and RESERVED keywords for this rewrite. ALL keywords parsed from the Grammer, RESERVED keywords from the manually defined list.
Since we could extract all keywords, like you mentioned, and read all RelObj .. methods, we have all lists we need to get the reserved keyword list directly from the grammar without rewriting it.
It is the other way around: we get the ALLOWED keywords and rewrite
RelObjectNames...accordingly.
Again look at the first point. Why should it be better to go this way? My way of thinking is that the grammar is carved in stone and all needed data should be derived from it.
Sure as I already mentioned, the RelObj - methods needs to be renamed to better reflect there meaning and use.
Certainly, I have asked you above for an explanation, what you had in mind when introducing the 4 methods. (The first 2 are clear to me, but I struggle to understand the methodology behind
...Extand...Ext2).
+5
You know I am very reluctant going your way. To clarify this, my big problem is this modifying the grammar during the build process and not reading the grammar using JavaCC and deriving all needed data from it. The build process is modified from the running source code and therefore introduces another level of maintenance for maven and gradle builds.
I have no problem in fact I am very in favour including automatic JUnit tests for all possible RelObject rule tokens like you did. The RelObject rule naming and number must be clarified and decreased.
You know I am very reluctant going your way. To clarify this, my big problem is this modifying the grammar during the build process
We do not and will not modify the Grammar during every build process. We provide 1 special task, to be invoked only on demand and only when the Token Definition has changed (e. g. after someone provided a PR introducing new tokens).
You can compare that with your Maven License Task, which would insert missing license headers. Now we have a Keyword Tasks which whitelists Tokens when possible.
and not reading the grammar using JavaCC and deriving all needed data from it.
We read the Grammar and we use JavaCC to extract the Tokens.
Oversimplified: White Listed Keywords = ALL Keywords - RESTRICTED Keywords
Please tell me how to maintain that equation automatically and in JavaCC only.
I have no problem in fact I am very in favour including automatic JUnit tests for all possible RelObject rule tokens like you did. The RelObject rule naming and number must be clarified and decreased.
The tests are only a marginal side product of the approach. Somehow I still fail to convey the major problem we are trying to solve: Automatic management of the Token Whitlisting inside RelObject and friends.
Again, I think the best way for a decision would be a call to directly explain and respond to questions. Please PM me, when possible. Thank you!
I gave it some extra tought and I think the automatic step to patch the grammar when keywords are whitelisted makes sense. However, I would be a bit reluctant to include tokens that are composed, like CURRENT DATE_TIME. I think they should be handled in a specific parsing rule for (in this case) dates. That would eliminate al the LOOKAHEAD(2) as well I think?
However, I would be a bit reluctant to include tokens that are composed
You can easily classify it as Restricted Keyword. The over simplified equation is: Allowed Keywords = All Keywords - Restricted Keywords.
All Keywords will be determined by the Grammar. Restricted Keywords will have to be declared manually.
I do not see a particular reason to NOT allow CURRENT_DATE or CURRENT_TIMESTAMP -- but that's not the hill I want to die on.
I meant CURRENT TIMESTAMP (without the underscore) .. But agree on the hill :)