Run regexes for TM grammars in native JS for perf
A couple of years ago in #165506, @fabiospampinato raised the idea of running TextMate grammars in JS using an Oniguruma to JS regex transpiler (for performance reasons and also potentially to remove the large Oniguruma dependency). However, the benefit was hypothetical at the time since there wasn't any regex transpiler written in JS that could actually do this, so a Ruby library was used as a proof point (but that wouldn't have worked since it's written in Ruby, transpiles Onigmo rather than Oniguruma, wasn't designed to support the way regexes are used in TextMate grammars, and wasn't robust enough to cover the long tail of grammars that often include complex regexes that rely on Oniguruma edge cases).
A library now exists (Oniguruma-To-ES) that solves these problems. It's lightweight and has been used for a while by the Shiki library, with support for the vast majority of TM grammars. Here's Shiki's compatibility list, which checks that its JS and WASM engines output identical highlighting results for Shiki's language samples. The issues with the handful of remaining unsupported grammars are well understood -- they are the result of bugs in the grammars (i.e., inclusion of an invalid Oniguruma regex), bugs in Oniguruma, or use of a few extremely rare features that can be supported in future versions or worked around.
Of course, VS Code wants to be a good OSS citizen and not break any grammars. Oniguruma-To-ES (as of v2.0) is up to the challenge, at a deep level. Perhaps, as a starting point, a few grammars that offer better performance could be marked to use JavaScript rather than Oniguruma, and then if everything goes smoothly its use could be expanded to additional grammars.
In a basic benchmark of Shiki's JS vs WASM engine (using precompiled versions of the grammars that had been pre-run through Oniguruma-To-ES using these options), the JS engine performed faster in some cases including the following examples (all with identical highlighting results compared to the WASM engine):
- Python: ~8.5x faster.
- MDC: ~13.5x faster.
- Markdown: ~3.3x faster.
- CSS: ~2.5x faster.
- SCSS: ~3.5x faster.
- Bash: ~2.6x faster.
- Kotlin: ~1.2x faster.
- Perl: ~1.4x faster.
- PHP: ~1.3x faster.
- Go: ~1.4x faster.
- Objective-C: ~1.3x faster.
These times are based on processing the language samples that Shiki provides; e.g. here's the Kotlin sample.
The JS engine with precompiled regexes is not faster than Oniguruma (via WASM) with all grammars, but there are optimization opportunities (this issue includes an example) that might increase the number of cases where it's faster.
Also note that Oniguruma-To-ES is faster than Oniguruma via WASM with some grammars even when transpiling regexes at runtime (without pre-running a grammar's regexes through it). In fact, Shiki doesn't pre-compile when using its standard JS engine. So it's not necessary to have separate grammar files (with an extra build step) to get some of the benefit.
I've updated the comment above based on updates to the library and more representative perf testing.
Thanks, this sounds very promising!
This feature request is now a candidate for our backlog. The community has 60 days to upvote the issue. If it receives 20 upvotes we will move it to our backlog. If not, we will close it. To learn more about how we handle feature requests, please see our documentation.
Happy Coding!
:slightly_smiling_face: This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation.
Happy Coding!
With the latest library and TM grammar updates, Shiki's JS regex engine (built on Oniguruma-To-ES) supports 100% (all 222) of Shiki's built-in languages. See: https://shiki.style/references/engine-js-compat
Anyone is working on this?
I'm currently try to make a monkey patch to ./vscode/node_modules/vscode-oniguruma/release/main.js for testing.
Update: My monkey patch works well now! see https://gist.github.com/kkocdko/b54dcee692deb67a13ec811fba5282c0
Update: Is this really faster? To avoid my own fault, I tries shiki's demo here https://textmate-grammars-themes.netlify.app (repo), in the top bar, switch to the JavaScript mode seems even slower than Oniguruma.
The test input is this . I press Enter to input newline twice and Backspace twice. Use F12 to record performance. The result:
Oniguruma (367ms):
JavaScript (2.60s):
The result is strongly related to the styled language, of course, more regexp patterns, more match tries. I see in the demo's findNextMatchSync it also use a simple for loop. For Oniguruma, it can do match only once per call, dig into engine to get which sub-expression is matched. But in JS I can not find out a way to do this.
Update: seems that named capturing group can do this? I'm tring...
Update: IMO we should prefer tree-sitter instead of fight with this ugly TextMate.
prob should look at
- https://github.com/shikijs/shiki
- https://github.com/microsoft/vscode-textmate
- https://github.com/microsoft/vscode-oniguruma
From the original post:
In a basic benchmark of Shiki's JS vs WASM engine (using precompiled versions of the grammars that had been pre-run through Oniguruma-To-ES using these options), the JS engine performed comparably for many grammars, faster for some, and slower for others. [...] These times are based on processing the language samples that Shiki provides
@kkocdko
Update: Is this really faster? [...] The result is strongly related to the styled language, of course, more regexp patterns, more match tries. I see in the demo's
findNextMatchSyncit also use a simple for loop. For Oniguruma, it can do match only once per call, dig into engine to get which sub-expression is matched. But in JS I can not find out a way to do this.
You cannot simply concatenate (with |) Oniguruma regex patterns together (for a variety of reasons), so yeah, native Oniguruma being able to apply multiple searches at once is an advantage that overtakes native JS performance in some cases.
Note that I was using Shiki's precompiled versions of grammars (with Shiki's createJavaScriptRawEngine), and it sounds like you're not doing that here. Precompiled grammars avoid the need to transpile the regexes at runtime. There is an open Shiki issue https://github.com/shikijs/shiki/issues/918 for precompiled grammars (not known when I first posted here) that prevents them from working correctly with some languages (more than a third) unfortunately. It can be fixed but hasn't been prioritized since the standard createJavaScriptRegexEngine (which transpiles regexes at runtime) is good enough for common cases and avoids the need to download the large WASM bundle, and avoiding that is often the main reason to use the JS engine. Interest from VS Code in using precompiled grammars could probably accelerate a fix.
But yeah, whether you're using precompiled grammars or not, the performance comparison with WASM depends on the specific grammar being used and on the text being highlighted, as I mentioned in the quote above. Anecdotally, on the Shiki grammar playground page you linked to, I'm seeing faster numbers for JS highlighting (which is not using precompilation) than for Oniguruma, via the timer that the playground self reports on the bottom.