chevrotain icon indicating copy to clipboard operation
chevrotain copied to clipboard

Ability to export grammar as EBNF text.

Open bd82 opened this issue 7 years ago • 12 comments

Flow:

  • call getGastProductions() on a parser instance.
  • build a walker to print out the Gast (grammar representation) as an EBNF text.

Advanced features:

  • Generate HTML with advanced features:
    • anchors (links)
    • find usages.
    • find definitions.
    • ?

bd82 avatar Oct 26 '16 18:10 bd82

I wouldn't mind taking a shot at this feature. if it's still up for grabs. It'd be useful.

jmrog avatar Apr 04 '19 01:04 jmrog

This is up for grabs, generating a simple EBNF text should be simple as the grammar structure is already well defined. Making it pretty with advanced features may be more complicated, I am uncertain how much as I was never good at HTML/CSS stuff 😄

bd82 avatar Apr 04 '19 10:04 bd82

For the "find usages" and "find definitions" features, what are you envisioning?

jmrog avatar Apr 05 '19 01:04 jmrog

Have a look at the playground:

  • https://sap.github.io/chevrotain/playground/

Hover on the syntax diagrams and see what lights up. Also try to click on terminals/non-terminals and see that it navigates to the code.

Of course the EBNF output would not be linked to the source code. But it could be possible to scroll down/up to a non-terminal definition, or highlight usages of a production which is hovered on.

But these are the advanced features of generating the EBNF in html. the most basic feature would be to generate a plain text file that could be included in some language's documentation...

bd82 avatar Apr 05 '19 09:04 bd82

@bd82 Sorry, I unfortunately haven't had a whole lot of time to work on this lately. I did finally get to do an initial pass through today, though; you can see the work in progress at the above commit.

Currently, running the example CSV grammar through it results in:

csvFile ::= hdr row+
hdr ::= row
row ::= field ( Comma field )* NewLine
field ::= ( Text | String | EMPTY_ALT )
Comma ::= ,
NewLine ::= \r?\n
Text ::= [^,\n\r"]+
String ::= "(?:""|[^"])*"

Things could be cleaned up a bit at various places, but this seems like a reasonable(-ish) first step. I'll keep working as time allows. Any thoughts as it currently stands?

jmrog avatar Apr 11 '19 22:04 jmrog

This looks good.

  • We can go for MVP as the first step.
  • I am not sure how we should represent the terminals, e.g: "(?:""|[^"])" vs /(?:""|[^"])/
  • Maybe we can use regExpToAst library to identify simple vs complex regExps and display those differently.
  • Maybe a newline between each terminal rule?

bd82 avatar Apr 12 '19 00:04 bd82

  • We can go for MVP as the first step.

Cool. That seems to me to be the right way to go, too.

  • I am not sure how we should represent the terminals, e.g: "(?:""|[^"])" vs /(?:""|[^"])/

Me either! I kept switching back and forth between those two formats and couldn't really decide between them. Ultimately, I looked around at some other EBNFs and found at least a couple where RegExp patterns are seemingly presented "raw" (without the / delimiters). For example:

  • https://www.w3.org/TR/sparql11-query/#rEXPONENT
  • https://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName

So, I ultimately left them that way. I think it just looks particularly weird in the CSV example case, especially because, for example, the Comma terminal is defined with a RegExp pattern rather than with a string (so, you get just ,). Typically, I think, that terminal would be defined as a string, in which case the EBNF walker that would output:

Comma ::= ','

which is at least a little less weird looking.

  • Maybe we can use regExpToAst library to identify simple vs complex regExps and display those differently.

Good idea; I'll look into this.

  • Maybe a newline between each terminal rule?

👍

jmrog avatar Apr 12 '19 13:04 jmrog

What if we have a regExp:

/foo|bar/

and we also have NonTerminals called "foo" and "bar". Won't it create ambiguius EBNF?

// Is this a Non-Terminal with two alternatives?
// Or a Terminal which can match the literal string "foo" or the literal string "bar"?
TerminalX ::= foo|bar

bd82 avatar Apr 12 '19 23:04 bd82

EcmaScript regExps also have constructs that do not exist in EBNF or worse yet may mean different things in EBNF.

e.g in EBNF: B{2} is a repetition of the 2 digit up to infinite times preceded by a 'B' char.

  • B2222
  • B2
  • B22222222222
  • B (0 repetitions)

In ECMAScript regexps however the same expression B{2} would mean 'B' repeated exactly twice.

  • BB

bd82 avatar Apr 12 '19 23:04 bd82

Won't it create ambiguius EBNF?

Yes, you're right. That's enough to convince me. I also realized that there are other problematic cases, such as when the RegExp pattern is case-insensitive. Rather than show just the source (e.g., 'ab') in that case -- or, worse, try to reconstruct an equivalent representation (e.g., '[abAB]') -- it would be best to show the full pattern with flags (e.g., '/ab/i'). So, delimiters it is!

in EBNF: B{2} is a repetition of the 2 digit up to infinite times preceded by a 'B' char.

Note that I have not been using this notation in my construction of the EBNF, though. I have instead been using the W3C's EBNF notation. This is partly for the reasons given here, but also because I suspect that the W3C style is likely to be more familiar (given its similarity to standard RegExp) and because it is the style that I know best. Would you prefer that I switch to the ISO-14977 style, or maybe add an argument that allows the user to specify the preferred style?

jmrog avatar Apr 13 '19 00:04 jmrog

Another question: what do we do with tokens that have custom token patterns?

jmrog avatar Apr 13 '19 03:04 jmrog

Would you prefer that I switch to the ISO-14977 style?

No, I think the W3C style is easier to understand as well.

maybe add an argument that allows the user to specify the preferred style?

Only if you have time, probably outside MVP scope.

Another question: what do we do with tokens that have custom token patterns?

We can't really define those, I guess we should just add a comment to their definition saying "Custom Token Pattern" and document this behavior in the guide/examples.

Note that when generating diagrams we use the Token Labels when possible. We should probably do so when generating the EBNF too.

  • https://sap.github.io/chevrotain/documentation/4_3_3/interfaces/itokenconfig.html#label

bd82 avatar Apr 13 '19 10:04 bd82