feat: optimize unoptimizable token types
Currently, when the lexer cannot apply the "first character" optimization to any token type, it will use an un-optimized algorithm for all token types. This behavior can also be seen when using the safeMode: true property on lexers. While this definitely makes sense in general, this leads to major performance regressions (possibly in the order of magnitudes) in case the token type cannot be equipped with the start_chars_hint field (like for unicode regexp).
This PR applies a major optimization in case the "first character" optimization fails: When encountering a non-optimizable token type the lexer will now attempt to match this token type on any tokenization attempt in addition to the optimized token types. More generally, the lexer will attempt all non-optimizable token types together with the optimized token types in the order in which they appear (this is important as we likely get unexpected behavior otherwise!).
If there is no optimized token type for the character at the current position, the lexer will now attempt to match all non-optimizable token types, only yielding a lexer error in case non of them match.
This new behavior mirrors the current behavior perfectly, and none of the existing tests can actually determine that something has changed in the lexer behavior. This change is a pure performance improvement for non-optimizable token types. Of course I added new tests to assert that the new change behaves as expected. Furthermore, using ensureOptimizations: true when creating a lexer still throws an error when attempting to use non-optimizable tokens, as it's not a perfect optimization (the best optimization is still to use the start_chars_hint for those tokens).
I haven't written a benchmark for Chevrotain yet to tell whether this yields the expected performance improvements for non-optimizable tokens, but I can already tell that it does not decrease performance for the normal use case (I'm not sure why the performance for JSON increases, I believe the first test that I run is always a bit slower, even when running everything multiple times):
Note that currently lexer.ts does not have a 100% test coverage. My new tests cover the newly added/changed lines, but do not increase the test coverage to 100%.
Thanks @msujew I will look at this soon (hopefully on the weekend...)
@bd82 You think you will have time to look at this soon? We're currently building a language server for PL/I which features configurable logical or and not tokens. Since this requires function based matching with unknown start chars (user supplied via config), this currently makes the lexer unbearably slow. This change would drastically improve lexing performance :)
FYI, this is no longer as urgent, since we are building our own lexer now. Nevertheless, I think this is a good improvement.
Hello @msujew
I did not have time to maintain Chevrotain recently. I have a little more time recently, hopefully I'll get around to this PR as well...