chevrotain icon indicating copy to clipboard operation
chevrotain copied to clipboard

feature request: ability to save Regex capture groups for tokens

Open christianvoigt opened this issue 7 years ago • 23 comments

Update 14/06/2019

@bd82 wrote:

Capturing groups support is now possible via the APIs for custom payloads on custom token patterns.

  • https://sap.github.io/chevrotain/docs/guide/custom_token_patterns.html#custom-payloads

An ease of use upgrade could provide utilities such as:

  • createTokenWithCaptureGroups
  • createtokenWithNamedCaptureGroups
  • e.g https://github.com/SAP/chevrotain/issues/888#issuecomment-457698103

This is up for grabs, PR welcome 😄


Original Issue

@christianvoigt wrote: (original message) In some cases it might be useful to save the whole match result with all capture groups. For example in my case this would make it possible to lex tokens for whole links or tags without having to define parser rules for them. I could simply capture the url/tag text and later retrieve it directly from the token, without having to execute another regular expression.

Maybe this could be achieved by manually emitting a token and adding the match result array as a token property. Or this could simply be a token setting, like "saveCaptureGroups:true".

christianvoigt avatar Mar 21 '17 11:03 christianvoigt

I'm a little hesitant to add more capabilities to the Lexer. I'm actually trying to simplify it now by reducing the number of supported Token types as its maintenance has gotten too complex and error prone.

As you suggested It seems like this capability should be implemented by having an "official" API for emitting tokens in custom matchers.

This could simply be a token setting, like "saveCaptureGroups:true".

This approach will mean that an additional "if" statement would be added in "hottest" path of the lexer's code. This will also affect the 99.9% of users who do not use this feature unless I duplicate the code but this code duplication that is already happening in the lexer due to performance reasons already caused a sneaky 20% performance degradation to hit 0.25.0 (fixed in 0.25.1).

bd82 avatar Mar 21 '17 11:03 bd82

I understand. So, it would be great if the official API for emitting tokens would allow to add extra data to the tokens. Do you think this approach could also work with SimpleLazyTokens?

christianvoigt avatar Mar 21 '17 12:03 christianvoigt

Do you think this approach could also work with SimpleLazyTokens?

These tokens are just javascript Objects. So you can add any property to them whenever, where-ever you like.

  • The problem is how to "explain" to the lexer that a custom Matcher emitted a Token?

    • Currently the Lexer expects a custom matcher to return null or an array with one string item.
    • So should we allow the custom matcher to also return a token Object?
      • That will have a performance penalty but this path will (hopefully) only be executed for the custom tokens. Not all users and all Tokens so thats fine.
  • The other question with token emitting is how to support emitting multiple "virtual" tokens as those needed in the indentation example?

    • Do we return an array of Tokens?
      • But the expected return is already an array so that is confusing.

So the real problem is creating a consistent API for token emitting, To support emitting both single and multiple tokens and both real(appear in text) and virtual ones (OutDent)

bd82 avatar Mar 21 '17 12:03 bd82

Why return an array that always contains one string item? Could this not be changed, so that the custom matcher directly returns the string?

In that case the custom matcher could also be allowed to return an object for a single emitted token or an array for several emitted tokens.

christianvoigt avatar Mar 21 '17 13:03 christianvoigt

Why return an array that always contains one string item?

To be compatible with the RegExp.exec API. because the result of a custom match and a regular regExp match are treated the same. But we we want better emit support than this will probably have to change.

See line 800 here: https://github.com/SAP/chevrotain/blob/52f3f1a4a7214b2b75c2ca95b58a1ec28ccbcfbd/src/scan/lexer_public.ts#L800

bd82 avatar Mar 21 '17 14:03 bd82

In that case the custom matcher could also be allowed to return an object for a single emitted token or an array for several emitted tokens.

Thats a good point. maybe a custom matcher should always take full responsibility for creating the token. And not only only return the "string" part of it. That way we will have a consistent API. To enable this the full location information must be passed to the custom matcher.

The problem with this is that we have different ways to create different token types. I will hopefully succeed in reducing this complexity but perhaps part of the API of a custom pattern should be a method to create a token.

function myCustomMatcher(text, StartOffset, matchedTokens, matchedGroups, createToken) {
   ///...
   var newToken0 = creaToken(Whitespace)
   var newToken1 = createToken(Outdent)

   return [newToken0, newToken1]
}

bd82 avatar Mar 21 '17 14:03 bd82

@christianvoigt I think your original problem can be solved in user space. Note that you can extend RegExp and define Symbol.match on its prototype to override matching behavior.

class MatchSaver extends RegExp {
	lastGroups;
	[Symbol.match](str) {
		const result = RegExp.prototype[Symbol.match].call(this, str);
		if (result) {
			const [, ...groups] = result;
			this.lastGroups = groups;
		}
		return result;
	}
}

Here's an example to try out in ES:

const dates = /(\d{2})\.(\d{2})\.(\d{4})/;
const saver = new MatchSaver(dates);
"01.01.1970".match(saver);

After the match call, the saver.lastGroups property will be ["01", "01", "1970"].

If you pass your regular expression wrapped in a MatchSaver to chevrotain, it should in theory all work out. 😄

ghost avatar Mar 21 '17 15:03 ghost

Note that you can extend RegExp and define Symbol.match on its prototype to override matching behavior.

I did not know this. I wonder in what EcmaScript versions this is supported? Although it won't help in this case because the Chevrotain Lexer will not do anything with this property. It will only extract the "0" property from the exec result will be extracted by Chevrotain to create the token instance.

bd82 avatar Mar 21 '17 15:03 bd82

I wonder in what EcmaScript versions this is supported?

It was introduced back in ES2015. So by now, all major engines should support it. Although I've used the public field initializer lastGroups in the above example, pushing the requirements it to a transpiler, but you could easily provide a constructor and declare lastGroups over there. Anyway, if you need to make it work in IE6, just use a transpiler and don't worry about it. 😼

There's some more overridable things (e.g. Symbol.search is able to override RegExp.prototype.search's behavior). I don't think that exec has its own well-known symbol, but it might already be enough if you just override it as a normal method on the prototype in your derived class.

Aside from that: Is there a major performance difference between RegExp.prototype.search, RegExp.prototype.exec and String.prototype.match? Tackling this by calling it differently might be the easiest fix, but I'm not sure about what that means for lexing performance without measuring.

ghost avatar Mar 21 '17 16:03 ghost

https://jsperf.com/exec-vs-match-vs-test-vs-search/10

bd82 avatar Mar 21 '17 16:03 bd82

Might not turn out as dramatic with #412, as you might have to use match there anyway.

ghost avatar Mar 21 '17 16:03 ghost

exec which the Chevrotain lexer already uses seems faster than match on modern chrome. So i'm not sure if there is any room for improvement here.

But I'm always happy to find more ways to increase performance as its a major feature of this library. 😄

bd82 avatar Mar 21 '17 17:03 bd82

exec which the Chevrotain lexer already uses seems faster than match on modern chrome. So i'm not sure if there is any room for improvement here.

But I'm always happy to find more ways to increase performance as its a major feature of this library. 😄

bd82 avatar Mar 21 '17 17:03 bd82

@kdex thanks for the info, that's interesting, event though (as @bd82 pointed out) it probably won't help me.

christianvoigt avatar Mar 22 '17 16:03 christianvoigt

I'm keeping this open for now as there is a major refactoring in progress And I'd like to reexamine these "custom token factories" after the refactoring has finished.

bd82 avatar Mar 23 '17 11:03 bd82

lexer refactoring has been completed with large benefits for both performance and long term code maintainability.

I'm afraid I don't see a good way to introduce "custom token factories" or better custom patterns support. But I will be willing merge PRs implementing this in a good way without a performance penalty.

This is where one should start looking:

  • Custom Patterns are matched here

  • This is where Tokens "instances" are created.

bd82 avatar Mar 29 '17 21:03 bd82

Just wondering if it's possible/performant to put the match result on IToken, which would also provide full access to the capture groups.

export interface IToken {
    /** The textual representation of the Token as it appeared in the text. */
    image: string
    /** Offset of the first character of the Token. */
    startOffset: number
    /** Line of the first character of the Token. */
    startLine?: number
    /** Column of the first character of the Token. */
    startColumn?: number
    /** Offset of the last character of the Token. */
    endOffset?: number
    /** Line of the last character of the Token. */
    endLine?: number
    /** Column of the last character of the Token. */
    endColumn?: number
    /** this marks if a Token does not really exist and has been inserted "artificially" during parsing in rule error recovery. */
    isInsertedInRecovery?: boolean
    /** An number index representing the type of the Token use <getTokenConstructor> to get the Token Type from a token "instance"  */
    tokenTypeIdx?: number
    /**
     * The actual Token Type of this Token "instance"
     * This is the same Object returned by the "createToken" API.
     * This property is very useful for debugging the Lexing and Parsing phases.
     */
    tokenType?: TokenType

    /*********************
     * The token's RegExp exex array result.
     *********************/
    match: RegExpExecArray
}
const token = $.CONSUME(SomeToken)
const fullMatch = token.match[0]
const group1 = token.match[1]
const group2 = token.match[2]

I'm asking because it would be quite convenient for my use case, as I have to re-match token images after I consume the token and I'm unsure how to otherwise accomplish this with the API.

jabney avatar Jan 24 '19 21:01 jabney

Just wondering if it's possible/performant to put the match result on IToken, which would also provide full access to the capture groups.

It should be possible.

The IToken is created here:

  • https://github.com/SAP/chevrotain/blob/4fe717f47bf8eba8111a10a9f7856c9b12e69191/src/scan/lexer_public.ts#L566-L574 So if that exec match array is available at that time it could be added, but naturally at the cost of memory as due to the reference it would not be GCed.

The problems are as follows:

  • I'd like to evaluate a rewrite of the Lexer to a streaming lexer. But too busy with other projects at this time, so maybe this feature should be evaluate as part of the new lexer but I have absolutely no idea when I will get around to it.
  • There are optimizations in the lexer to use RegExp.test instead of RegExp.exec So that naturally won't work when the matching groups are required.
    • Perhaps the speed benefit from avoiding the duplicate matches would be lost due to losing this optimization.
  • This feature may require a breaking change in Custom Token Patterns support.
    • I am not opposed to breaking changes, just something to keep in mind.

If you are interested you can explore implementing this. The first step would be to modify the lexer matching to always use .exec (in this.match) and return the whole exec result.

  • https://github.com/SAP/chevrotain/blob/4fe717f47bf8eba8111a10a9f7856c9b12e69191/src/scan/lexer_public.ts#L513
  • https://github.com/SAP/chevrotain/blob/4fe717f47bf8eba8111a10a9f7856c9b12e69191/src/scan/lexer_public.ts#L830-L845

Than this result should be passed to the IToken creation

  • https://github.com/SAP/chevrotain/blob/4fe717f47bf8eba8111a10a9f7856c9b12e69191/src/scan/lexer_public.ts#L566-L574

And then the performance should be evaluated using the benchmark.

  • https://github.com/SAP/chevrotain/tree/master/benchmark_web

The base implementation seems to be not very complicated, the complexity would be added with these questions:

  • The performance impact?
    • Could it be mitigated via a flag to avoid collecting exec results?
    • Could it be mitigated only collect exec results for specific token Types?
      • For example parse the RegExp and understand if there are any capturing groups defined in it?
        • Chevrotain already analyzes the regExp source for other optimizations...
    • Could those mitigations employ the NOOP (empty function gets inlined and disappears in V8) for better performance.
  • Modifications needed for custom patterns support?
    • If so what needs to be updated in the docs and examples.
    • The breaking changes document would need be updated.

If you are interested and want to contribute a PR 😄 with at least the base functionality working we could at least start getting exploring some of the performance questions.

Cheer.

bd82 avatar Jan 24 '19 22:01 bd82

If you are interested and want to contribute a PR 😄 with at least the base functionality working we could at least start getting exploring some of the performance questions.

Yes I'm interested.

But too busy with other projects at this time

Understood. I'm working under some difficult deadlines myself at the moment and won't be able start right away.

Perhaps the speed benefit from avoiding the duplicate matches would be lost due to losing this optimization.

That's a good point. I'll consider this when deciding if there's enough of a benefit in my case. This may be enough of an uncommon use case that it doesn't warrant affecting the performance of the lexer.

Thanks for the pointers to get started.

jabney avatar Jan 24 '19 22:01 jabney

This may be enough of an uncommon

I think it is common enough (Dates/String Literals) that makes this use case interesting. The challenge would be how to enable this feature only when needed (Even per specific Token Type).

Note that a TokenType could be required to explicitly indicate (in createToken) it wants to save capturing groups.

Thanks for the pointers to get started.

Sure Feel free to ask any more questions. If/When there is a prototype working I will experiment on improving its performance.

bd82 avatar Jan 24 '19 22:01 bd82

Note that a TokenType could be required to explicitly indicate (in createToken) it wants to save capturing groups.

That would seem to be better than a more general lexer flag, as it could preserve the matchWithTest optimization on tokens that don't need to capture groups.

Sure Feel free to ask any more questions. If/When there is a prototype working I will experiment on improving its performance.

Great! Will do. 😎

jabney avatar Jan 24 '19 23:01 jabney

closing in favor of #888

bd82 avatar Jan 25 '19 19:01 bd82

re-opening as I am treating this as a separate feature that would be implemented using #888 By providing wrappers around createToken. See: https://github.com/SAP/chevrotain/issues/888#issuecomment-457698103

bd82 avatar Jun 12 '19 21:06 bd82