syntax icon indicating copy to clipboard operation
syntax copied to clipboard

Allow regexp definitions directly in BNF grammar

Open DmitrySoshnikov opened this issue 8 years ago • 2 comments

Currently it's possible to use raw (wrapped in quotes) literal tokens in BNF.

%%

E : N;

N : D N | /* empty */ ;

D : "1" | "2" | "3" ... ;

We should also allow embedding regexp rules from lexical grammar directly to the BNF.

%%

E : N;

N : D N | /* empty */ ;

D : [0-9]+ ;

DmitrySoshnikov avatar Jan 28 '17 10:01 DmitrySoshnikov

Just simple character classes, or full-blown regexes? The latter would have very strange behaviour, because you can end up with overlapping regex matches in the CFG...

tjvr avatar Feb 28 '17 17:02 tjvr

Good point, this might collide with BNF parser itself.

Probably can start from simple use-cases, like character classes. But the basic idea is the following (same as for simple literal tokens in BNF):

The way it's done now for simple terminal tokens, is that a lex-rule is automatically created for it. E.g. when you have:

D : "1" | "2" | "3";

The Grammar class understands that these are terminal tokens, and creates the following lexical rules for them:

rules: [
  [`1`, `return "1"`],
  [`2`, `return "2"`],
  [`3`, `return "3"`],
]

I.e. token type corresponds to the matching input, so likely not full-blown regexes (though, full regexes might even work if a grammar is defined in JSON-like notation, i.e. we don't need to parse it, instead of Yacc/Bison notation).

Potentially, having [0-9]+ on RHS of a production shouldn't differ much.

D: [0-9]+;

The Grammar class will understand that it's not a non-terminal, and must be either a "literal" (in quotes) or token, and will create a lex-rule for it as:

rules: [
  [`[0-9]+`, `return "_AUTO_LEX_RULE_TOKEN_1"`],
]

(Note, in this case it'll be recognized as a token, not as a literal, since it's not in quotes. I.e. i'll be be under getTokens(), not in getTerminals()).

Thus in the BNF grammar the entry can be replaced to:

D: _AUTO_LEX_RULE_TOKEN_1;

This is just a convenient "nice-to-have" feature, in order not to define an explicit lexical grammar for simple cases. On practice though, this can be just solved using a lexical grammar rule.

DmitrySoshnikov avatar Feb 28 '17 21:02 DmitrySoshnikov