Allow regexp definitions directly in BNF grammar
Currently it's possible to use raw (wrapped in quotes) literal tokens in BNF.
%%
E : N;
N : D N | /* empty */ ;
D : "1" | "2" | "3" ... ;
We should also allow embedding regexp rules from lexical grammar directly to the BNF.
%%
E : N;
N : D N | /* empty */ ;
D : [0-9]+ ;
Just simple character classes, or full-blown regexes? The latter would have very strange behaviour, because you can end up with overlapping regex matches in the CFG...
Good point, this might collide with BNF parser itself.
Probably can start from simple use-cases, like character classes. But the basic idea is the following (same as for simple literal tokens in BNF):
The way it's done now for simple terminal tokens, is that a lex-rule is automatically created for it. E.g. when you have:
D : "1" | "2" | "3";
The Grammar class understands that these are terminal tokens, and creates the following lexical rules for them:
rules: [
[`1`, `return "1"`],
[`2`, `return "2"`],
[`3`, `return "3"`],
]
I.e. token type corresponds to the matching input, so likely not full-blown regexes (though, full regexes might even work if a grammar is defined in JSON-like notation, i.e. we don't need to parse it, instead of Yacc/Bison notation).
Potentially, having [0-9]+ on RHS of a production shouldn't differ much.
D: [0-9]+;
The Grammar class will understand that it's not a non-terminal, and must be either a "literal" (in quotes) or token, and will create a lex-rule for it as:
rules: [
[`[0-9]+`, `return "_AUTO_LEX_RULE_TOKEN_1"`],
]
(Note, in this case it'll be recognized as a token, not as a literal, since it's not in quotes. I.e. i'll be be under getTokens(), not in getTerminals()).
Thus in the BNF grammar the entry can be replaced to:
D: _AUTO_LEX_RULE_TOKEN_1;
This is just a convenient "nice-to-have" feature, in order not to define an explicit lexical grammar for simple cases. On practice though, this can be just solved using a lexical grammar rule.