chevrotain
chevrotain copied to clipboard
Ability to export grammar as EBNF text.
Flow:
- call getGastProductions() on a parser instance.
- build a walker to print out the Gast (grammar representation) as an EBNF text.
Advanced features:
- Generate HTML with advanced features:
- anchors (links)
- find usages.
- find definitions.
- ?
I wouldn't mind taking a shot at this feature. if it's still up for grabs. It'd be useful.
This is up for grabs, generating a simple EBNF text should be simple as the grammar structure is already well defined. Making it pretty with advanced features may be more complicated, I am uncertain how much as I was never good at HTML/CSS stuff 😄
For the "find usages" and "find definitions" features, what are you envisioning?
Have a look at the playground:
- https://sap.github.io/chevrotain/playground/
Hover on the syntax diagrams and see what lights up. Also try to click on terminals/non-terminals and see that it navigates to the code.
Of course the EBNF output would not be linked to the source code. But it could be possible to scroll down/up to a non-terminal definition, or highlight usages of a production which is hovered on.
But these are the advanced features of generating the EBNF in html. the most basic feature would be to generate a plain text file that could be included in some language's documentation...
@bd82 Sorry, I unfortunately haven't had a whole lot of time to work on this lately. I did finally get to do an initial pass through today, though; you can see the work in progress at the above commit.
Currently, running the example CSV grammar through it results in:
csvFile ::= hdr row+
hdr ::= row
row ::= field ( Comma field )* NewLine
field ::= ( Text | String | EMPTY_ALT )
Comma ::= ,
NewLine ::= \r?\n
Text ::= [^,\n\r"]+
String ::= "(?:""|[^"])*"
Things could be cleaned up a bit at various places, but this seems like a reasonable(-ish) first step. I'll keep working as time allows. Any thoughts as it currently stands?
This looks good.
- We can go for MVP as the first step.
- I am not sure how we should represent the terminals, e.g: "(?:""|[^"])" vs /(?:""|[^"])/
- Maybe we can use regExpToAst library to identify simple vs complex regExps and display those differently.
- Maybe a newline between each terminal rule?
- We can go for MVP as the first step.
Cool. That seems to me to be the right way to go, too.
- I am not sure how we should represent the terminals, e.g: "(?:""|[^"])" vs /(?:""|[^"])/
Me either! I kept switching back and forth between those two formats and couldn't really decide between them. Ultimately, I looked around at some other EBNFs and found at least a couple where RegExp patterns are seemingly presented "raw" (without the /
delimiters). For example:
- https://www.w3.org/TR/sparql11-query/#rEXPONENT
- https://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName
So, I ultimately left them that way. I think it just looks particularly weird in the CSV example case, especially because, for example, the Comma
terminal is defined with a RegExp pattern rather than with a string (so, you get just ,
). Typically, I think, that terminal would be defined as a string, in which case the EBNF walker that would output:
Comma ::= ','
which is at least a little less weird looking.
- Maybe we can use regExpToAst library to identify simple vs complex regExps and display those differently.
Good idea; I'll look into this.
- Maybe a newline between each terminal rule?
👍
What if we have a regExp:
/foo|bar/
and we also have NonTerminals called "foo" and "bar". Won't it create ambiguius EBNF?
// Is this a Non-Terminal with two alternatives?
// Or a Terminal which can match the literal string "foo" or the literal string "bar"?
TerminalX ::= foo|bar
EcmaScript regExps also have constructs that do not exist in EBNF or worse yet may mean different things in EBNF.
e.g in EBNF: B{2}
is a repetition of the 2 digit up to infinite times preceded by a 'B' char.
- B2222
- B2
- B22222222222
- B (0 repetitions)
In ECMAScript regexps however the same expression B{2}
would mean 'B' repeated exactly twice.
- BB
Won't it create ambiguius EBNF?
Yes, you're right. That's enough to convince me. I also realized that there are other problematic cases, such as when the RegExp pattern is case-insensitive. Rather than show just the source (e.g., 'ab') in that case -- or, worse, try to reconstruct an equivalent representation (e.g., '[abAB]') -- it would be best to show the full pattern with flags (e.g., '/ab/i'). So, delimiters it is!
in EBNF: B{2} is a repetition of the 2 digit up to infinite times preceded by a 'B' char.
Note that I have not been using this notation in my construction of the EBNF, though. I have instead been using the W3C's EBNF notation. This is partly for the reasons given here, but also because I suspect that the W3C style is likely to be more familiar (given its similarity to standard RegExp) and because it is the style that I know best. Would you prefer that I switch to the ISO-14977 style, or maybe add an argument that allows the user to specify the preferred style?
Another question: what do we do with tokens that have custom token patterns?
Would you prefer that I switch to the ISO-14977 style?
No, I think the W3C style is easier to understand as well.
maybe add an argument that allows the user to specify the preferred style?
Only if you have time, probably outside MVP scope.
Another question: what do we do with tokens that have custom token patterns?
We can't really define those, I guess we should just add a comment to their definition saying "Custom Token Pattern" and document this behavior in the guide/examples.
Note that when generating diagrams we use the Token Labels when possible. We should probably do so when generating the EBNF too.
- https://sap.github.io/chevrotain/documentation/4_3_3/interfaces/itokenconfig.html#label