chevrotain icon indicating copy to clipboard operation
chevrotain copied to clipboard

Fuzzy Parsing Example

Open bd82 opened this issue 4 years ago • 9 comments

See:

  • https://github.com/antlr/antlr4/blob/master/doc/wildcard.md#nongreedy-parser-subrules

This could probably be done using Token Categories to match against "all" kinds of tokens combined with an alternation which includes the one option we care about.

e.g:

      $.RULE("fuzzyConstantFinder", () => {
        $.OR([
          // We may need a GATE here, e.g using backtracking to ensure we never incorrectly enter
         // the constant alternative. 
          {ALT: () => $.SUBRULE($.constant)}, //
          {ALT: () => $.CONSUME(AnyToken)}, // AnyToken should be defined using Token Categories.
        ]);
      });

      $.RULE("constant", () => {
           $.CONSUME(Public);
           $.CONSUME(Static);
           $.CONSUME(Final);
           $.CONSUME(Int);
           $.CONSUME(Ident);
      });

bd82 avatar Aug 07 '19 21:08 bd82

The documentation for Token Categories is rather sparse. What's a practical application?

matthew-dean avatar Aug 08 '19 01:08 matthew-dean

Basically you can specify that multiple tokens are of the category X. and than match against said X in the Parser. If you think of the CONSUME(X) method as:

  • Eat next token iff next token is instanceof X.

Then token categories allow you to define multiple "inheritance" between tokens.

Example:

So basically this can be expended to do a "MatchALL" Token which could be used as part of a fuzzy parsing solution.

bd82 avatar Aug 08 '19 08:08 bd82

A good example of fuzzy parsing would be able to return a list of commands and whatever text is between commands for further parsing.

machine:chevrotain user$ ls
CONTRIBUTING.md		greenkeeper.json	readme.md
LICENSE.txt		lerna.json		tslint.json
NOTICE.txt		package.json		yarn.lock
examples		packages
machine:chevrotain user$ cat NOTICE.txt 
Copyright (c) 2015-2019 SAP SE or an SAP affiliate company.
machine:chevrotain user$ 

Sciumo avatar Aug 08 '19 17:08 Sciumo

I am not sure this can even be represented as a context free grammar. Can you assume some delimiters between commands? For example the "$" sign never being present in the command output? or perhaps the "machine:chevrotai user$" prefix appearing before every command?

bd82 avatar Aug 08 '19 17:08 bd82

This task is accomplished using regexp scanners. IMHO a good use of Chevrotain is the construction and management of domain specific tokenizers at run time. Scanning the resulting tokens for re-tokenizing and possibly complete parsing. My goal is to learn tokens at run time, and restart the process. Fuzzy parsing is a means of implementing a learning parsing system.

Sciumo avatar Aug 08 '19 17:08 Sciumo

IMHO a good use of Chevrotain is the construction and management of domain specific tokenizers at run time.

Interesting, you could effectively use some heuristics to identify the delimiter machine:chevrotain user$ and then dynamically build a lexer that would be able to fuzzy scan those commands.

In your case the "grammar" itself seems trivial so I am not sure there would be any need for a Chevrotain Parser part.

bd82 avatar Aug 08 '19 19:08 bd82

BTW you can dynamically create Chevrotain Parsers as well as Lexers using the custom APIs feature

Granted this has limitations:

  • https://sap.github.io/chevrotain/docs/guide/custom_apis.html#limitations

And it also requires the use of EVAL (not supported when there is a context security policy, e.g many websites).

But could still be interesting...

bd82 avatar Aug 08 '19 20:08 bd82

I've started playing around with a similar scenario which required consuming any kind of tokens between a "--" and a semiColon.

Basically the CSS 3 custom property syntax:

  • https://www.w3.org/TR/css-variables-1/#custom-property

You can inspect the current state of the example here:

  • https://github.com/SAP/chevrotain/commit/4a7eb7923a065832c0363e354723af2a33bad0e1

bd82 avatar Aug 10 '19 09:08 bd82

Consider expanding the fuzzy parsing example to a scenario in which the fuzzy matching acts as a default fall back.

e.g: 3 Alternatives, and the 3rd one being the fuzzy one which could conflict with the first 2.

bd82 avatar Sep 05 '19 19:09 bd82