grammars-v4
grammars-v4 copied to clipboard
[rexx] Separate dialects from ANSI standard Rexx
The current Rexx grammar supports several non-standard syntaxes. These are peculiar to one dialect of Rexx, and should be moved to a separate grammar that makes it clear which dialect it is. There are also quite a few ANSI Standard Rexx elements that are not included in the current grammar. These should be cleaned up.
I propose to:
- Add all the missing ANSI elements to the existing Rexx*.g4 files, and to add a comment that it is explicitly the ANSI standard syntax.
- Move the non-standard elements out to dialect-specific grammar files, named and commented to clearly identify the dialect. The dialect files will import the ANSI standard files, so there will be very little duplication - just the tokens and productions that are modified by the dialect.
- At a later date, I may create dialect files for some of the major dialects (e.g., Regina, IBM CMS, IBM Rexx Compiler). In each case, the variations are quite limited - every known dialect hews pretty close to the standard.
The non-standard dialect files would be named appropriately, and would be in the same directory as the standard ones.
@bknaysi - you may be interested in this issue. I don't know how your folks use the current Rexx grammar, but I expect the impact to them would be simply some filename changes and a few extra files to add to their projects.
It would only affect us if the lexer tokens or parser rule definitions changed - which we are okay with - as we acknowledge that dialect differences exist. I agree that it would be optimal if the ANSI standard tokens were defined in a separate lexer file as it would allow for dialect specific parser files without introducing too much duplication (at least on the lexer side).
I'm not contemplating any changes to token names or rule names as part of this change. The differences between the current Rexx*.g4 files and the ANSI standard are numerous, but small. The original author of the parser clearly attempted to transliterate the ANSI standard into ANTLR code. The only major place they didn't do that is in the DO instruction, and I expect to leave the rule current names unchanged.
"The non-standard dialect files would be named appropriately, and would be in the same directory as the standard ones."
@kaby76 Would having multiple grammars in the same directory cause your "trash" tools (or their use in the unit tests) any grief? I'm thinking that I'd have a structure like this:
- grammars-v4/rexx
- RexxLexer.g4, RexxParser.g4 - the ANSI standard language, no imports
- IbmTsoERexxLexer.g4, IbmTsoERexxParser.g4 - the "TSO/E" dialect, importing RexxParser.g4 and adding a few override
- ReginaRexxLexer.g4, ReginaRexxParser.g4 - (eventually) the "Regina" dialect, importing RexxParser.g4 and adding a few overrides
- RexxParserBase.java - the Java parser superclass for all dialects (at least so far)
- RexxParserBase.py - the Python parser superclass for all dialects (at least so far)
Or would it be better to create a tree of folders, something like this:
- grammars-v4/rexx
- ANSI
- Rexx*.g4
- IbmTsoE
- IbmTsoERexx*.g4 - importing ../ANSI/Rexx*.g4
- Regina
- ReginaRexx*.g4 - importing ../ANSI/Rexx*.g4
- RexxParserBase.java
- RexxParserBase.py
- ANSI
I don't expect there will ever be a need for different superclass files for different dialects, so I'd leave those at the top level.
trgen should be able to handle either for most targets except Go. For Go, it cannot follow either of these directory structures because the language doesn't allow scoping that doesn't mirror the directory structure. That's why the Antlr Go runtime is completely flat--it's just a terrible language.
"I'm not contemplating any changes to token names or rule names as part of this change."
Almost, but not quite.
- The current
Caret_Return_lexer fragment will be renamed toCarriage_Return_, which is the correct term for\r. This shouldn't matter to anyone, because lexer fragments aren't visibile outside the lexer itself. They can't be used in the parser at all. And they aren't tokens in their own right, so they can't appear in a parse tree. - The current
LINE_COMMENTtoken will be deleted, because neither the ANSI standard nor the TSO/E dialect reference support to-end-of-line comments. These tokens are currently put on theHIDDENchannel, so this shouldn't be visibile in the parse tree, but if there is a dialect I'm not aware of that supports such comments, this will be a breaking change for its users. - The current
BLOCK_COMMENTtoken will be renamed toCOMMENT, as it will be the only kind of comment. These tokens are currently put on theHIDDENchannel, so this shouldn't be visibile in the parse tree.