chevrotain
chevrotain copied to clipboard
Fuzzy Parsing Example
See:
- https://github.com/antlr/antlr4/blob/master/doc/wildcard.md#nongreedy-parser-subrules
This could probably be done using Token Categories to match against "all" kinds of tokens combined with an alternation which includes the one option we care about.
e.g:
$.RULE("fuzzyConstantFinder", () => {
$.OR([
// We may need a GATE here, e.g using backtracking to ensure we never incorrectly enter
// the constant alternative.
{ALT: () => $.SUBRULE($.constant)}, //
{ALT: () => $.CONSUME(AnyToken)}, // AnyToken should be defined using Token Categories.
]);
});
$.RULE("constant", () => {
$.CONSUME(Public);
$.CONSUME(Static);
$.CONSUME(Final);
$.CONSUME(Int);
$.CONSUME(Ident);
});
The documentation for Token Categories is rather sparse. What's a practical application?
Basically you can specify that multiple tokens are of the category X. and than match against said X in the Parser. If you think of the CONSUME(X) method as:
- Eat next token iff next token is instanceof X.
Then token categories allow you to define multiple "inheritance" between tokens.
Example:
- Definition of BinaryOperator "Abstract" Token.
- LessEquals Token defined to match the BinaryOperator category
- BinaryOperator Token Category used to concisely match against multiple kinds of Tokens in the Parser
So basically this can be expended to do a "MatchALL" Token which could be used as part of a fuzzy parsing solution.
A good example of fuzzy parsing would be able to return a list of commands and whatever text is between commands for further parsing.
machine:chevrotain user$ ls
CONTRIBUTING.md greenkeeper.json readme.md
LICENSE.txt lerna.json tslint.json
NOTICE.txt package.json yarn.lock
examples packages
machine:chevrotain user$ cat NOTICE.txt
Copyright (c) 2015-2019 SAP SE or an SAP affiliate company.
machine:chevrotain user$
I am not sure this can even be represented as a context free grammar. Can you assume some delimiters between commands? For example the "$" sign never being present in the command output? or perhaps the "machine:chevrotai user$" prefix appearing before every command?
This task is accomplished using regexp scanners. IMHO a good use of Chevrotain is the construction and management of domain specific tokenizers at run time. Scanning the resulting tokens for re-tokenizing and possibly complete parsing. My goal is to learn tokens at run time, and restart the process. Fuzzy parsing is a means of implementing a learning parsing system.
IMHO a good use of Chevrotain is the construction and management of domain specific tokenizers at run time.
Interesting, you could effectively use some heuristics to identify the delimiter machine:chevrotain user$
and then dynamically build a lexer that would be able to fuzzy scan those commands.
In your case the "grammar" itself seems trivial so I am not sure there would be any need for a Chevrotain Parser part.
BTW you can dynamically create Chevrotain Parsers as well as Lexers using the custom APIs feature
Granted this has limitations:
- https://sap.github.io/chevrotain/docs/guide/custom_apis.html#limitations
And it also requires the use of EVAL (not supported when there is a context security policy, e.g many websites).
But could still be interesting...
I've started playing around with a similar scenario which required consuming any kind of tokens between a "--" and a semiColon.
Basically the CSS 3 custom property syntax:
- https://www.w3.org/TR/css-variables-1/#custom-property
You can inspect the current state of the example here:
- https://github.com/SAP/chevrotain/commit/4a7eb7923a065832c0363e354723af2a33bad0e1
Consider expanding the fuzzy parsing example to a scenario in which the fuzzy matching acts as a default fall back.
e.g: 3 Alternatives, and the 3rd one being the fuzzy one which could conflict with the first 2.