shiki Some precompiled grammars aren't showing correct highlighting

Some but not all JS raw (precompiled) grammars lead to incorrect highlighting. Examples where this is happening include python, html, perl, and yaml.

Compare the following two screenshots for highlighting Shiki's Python sample:

Using the WASM or JS engine (correct)

Using the JS Raw engine with the precompiled grammar (incorrect)

This gives the same broken result for Python with all the versions I tested (Shiki 2.3.1, 2.3.0, 2.2.0, and 2.0.3).

Feb 04 '25 23:02 slevithan

That's indeed weird. Can you reproduce that in the tests as well?

Feb 05 '25 16:02 antfu

Can you reproduce that in the tests as well?

Yes.

I just updated engine-javascript/test/raw.test.ts to additionally import @shikijs/langs-precompiled/python and then added the following test:

  expect(
    shiki.codeToHtml('"""test"""', { lang: 'python', theme: 'vitesse-light' }),
  )
    .toMatchInlineSnapshot(`"<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test</span><span style="color:#B5695977">"""</span></span></code></pre>"`)

That inline snapshot is from the non-raw JS engine.

The test failed with this error:

Expected: ""<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test</span><span style="color:#B5695977">"""</span></span></code></pre>""

Received: ""<pre class="shiki vitesse-light" style="background-color:#ffffff;color:#393a34" tabindex="0"><code><span class="line"><span style="color:#B5695977">"""</span><span style="color:#B56959">test"""</span></span></code></pre>""

Just an idea: Is it possible that some regexes (perhaps due to some character in their source) are not being serialized accurately into the precompiled output?

Feb 05 '25 18:02 slevithan

By looking at the Python grammar's begin/end regexes for docstrings (which is where the problem starts in the screenshot and the test above), I've realized the cause of the problem.

vscode-textmate has a hacky system for replacing things that look like numbered backreferences in end/while patterns with the text actually matched by that capture number in the start pattern. In this way, an end/while pattern can refer back to something matched in its start pattern.

Since, for precompiled languages, the patterns don't go through this match-time replacement step, any languages that rely on this feature don't work correctly.

This means some precompiled end/while regexes need a fundamental change. Most likely, the specific affected regexes shouldn't get precompiled. More specifically, it would fix the problem if we stopped precompiling end/while patterns that are matched by JS regex /\\\d/ (which is a flawed check for numbered backrefs for several reasons, but it's what vscode-textmate looks for), and started transpiling that subset of Oniguruma patterns into native JS regexes at runtime.

An alternative, non-recommended path could be to instead dynamically update the precompiled regexes at runtime (i.e., doing a runtime replacement on the generated JS RegExp pattern rather than the original Oniguruma pattern). But in that case there would be two additional problems. These are explained below in detail, but the short version is that I don't think this alternative path is viable.

(Edit: Collapsed the details, because it might be confusing and likely won't work well anyway.)

Issue 1. Behavior for backreferences to nonparticipating capturing groups (NPCGs)

There's a difference between Oniguruma and JavaScript behavior for backreferences to NPCGs.

In JavaScript, due to a regex engine design mistake that's been there since the beginning (ES3), any backreference to a capture that hasn't (yet) participated in the match matches the empty string.
In every other modern regex flavor (including Oniguruma), such backreferences fail to match, causing the regex engine to backtrack or fail.

More context:

TextMate grammars, through the feature of sharing backreferences across patterns (IMO this was a really bad idea), invites TM grammar authors to write end/while patterns that are invalid Oniguruma regexes (with backreferences that refer back to nothing and would be an error, or that reference a capture in their own pattern that will not be respected).
Oniguruma-To-ES enables this TM system to work via its rules.allowOrphanBackrefs option, which not only suppresses the validation for numbered backreferences to captures that don't exist, but also adds the necessary number of additional captures needed at the end of the regex to make it so that the backreference isn't an error in JS (when using flag u/v).

So far, there's no problem. But because of the Oniguruma/JS difference in NPCG handling, backreferences to captures that can be determined at compile time to not be able to participate at that point in the regex are converted to (?!), which never matches. This is so that it follows the Oniguruma behavior. Actually there are multiple potential outcomes depending on how a backref appears:

It's a validation error when the referenced capturing group in the original pattern is to the right of the backreference (but this error is suppressed for numbered backrefs by allowOrphanBackrefs).
Duplicate group names (and subroutines) to the right of the backreference are not included in the backreference's multiplexing. (Backref multiplexing is an Oniguruma-specific feature where backrefs to duplicate group names rematch any of the values matched by groups of that name.)
The backreference fails to match (or leaves out of multiplexing) ancestor groups and groups in preceding alternation paths.

It's not important to grok all of this. At the end of the day, it just means that Oniguruma-To-ES is amazingly able to match Oniguruma's behavior for NPCGs in native JS regexes. But it's relevant because it means that the numbered backref that vscode-textmate will look for (in its hacky match-time pattern replacement) might not be present in the generated pattern.

A real example is the (\1) that the Python grammar uses as its end pattern for docstrings. To match Oniguruma behavior in JS, this gets converted to ((?!)). Now there's no '\\1' to be replaced by vscode-textmate. This is because there's an existing capturing group 1 in this pattern which hasn't yet finished participating when we see the \1 backreference to it.

It wouldn't help to add a property like _rawSource to all generated regexes and then have vscode-textmate (or some inlined code in the JS Raw engine) do a replacement on that, because the raw source is an Oniguruma pattern so we wouldn't be able to use it anyway in a precompiled grammar that doesn't do runtime transpilation.

I could update the allowOrphanBackrefs option to also not convert NPCG backrefs to (?!), and then hope for the best: that no real-world grammars rely on accurate reproduction of this Oniguruma behavior (which is probably true, since only deep regex experts would intentionally rely on it -- most people aren't even aware of the difference). I think this change would be fine to make, and we'd know right away via the JS engine report whether this causes a problem for any of the Shiki language samples. With this change to the allowOrphanBackrefs option, all of the numbered backreferences would remain in place, so could be targeted by the search and replace that TextMate engines do. And then the updated pattern with merged backrefs could just be passed to RegExp, without needing runtime transpilation.

That much I'd be comfortable with, but then there's another issue...

Issue 2. Searching in the generated JS pattern rather than the original Oniguruma pattern

Oniguruma-To-ES uses numbered backreferences for generated source even when the original pattern used named backreferences. This could lead to TextMate's backref-merging system replacing backreferences that it shouldn't (since vscode-textmate only looks for numbered backrefs, not named). And this isn't fully avoidable due to a variety of complexities (e.g. duplicate group names, backref multiplexing, recursion). Things might still end up okay for all of the current samples/grammars (I'm not sure), but we'd be introducing a potential bug vector, due to doing backref replacements on generated JS regex source rather than the original Oniguruma pattern.

I have ideas on how this bug vector could be avoided, but it would introduce several unfortunate layers of complexity in Oniguruma-To-ES and the JS Raw engine.

But leaving the non-recommended path aside...

As mentioned up front, the right path is probably runtime transpilation of end/while patterns that contain numbered backrefs. That would lose some of the benefit of precompilation for languages that are affected by this issue, because in those cases it would require importing all of oniguruma-to-es rather than just its EmulatedRegExp class. But the perf benefit of not transpiling the vast majority of regexes in the grammars at runtime would remain.

But yeah, right now a lot of the precompiled languages (more than a third) are broken because of this deep flaw (sometimes severely broken). To see the scale of the problem--i.e., the number of grammars relying on this unfortunate backref-merging feature--you can change allowOrphanBackrefs to false in engine-compile.js and then run the JS engine compat report. The number of supported grammars drops from 217 currently to 137. That is possibly underestimating, since in cases where there are enough capturing groups in the end/while pattern it won't lead to an error. But in those ~80 affected precompiled languages, it is likely only affecting a small number of regexes each.

I'd be happy to make changes to Oniguruma-To-ES if helpful, but I'd rely on you for updates to the precompiled language system (doing the runtime replacement on generated regex source and passing that to RegExp).

Alternatively, precompiled languages could simply be dropped / not provided for the ~80 that are currently broken. That wouldn't be my preference, but that might be needed anyway in the short term if will take longer to implement a solution.

What do you think?

Feb 06 '25 19:02 slevithan

both vscode-textmate and TextMate 2.0 physically replace the backreferences in end/while with the text captured in begin they do both escape most of regex's meta characters but there are some differences vscode-textmate TextMate 2.0 bug report

vscode-textmate: replaces any backreference in the form \\[0-9]+ including \\00009999999 this does mean that \\0 (and \\0000000) matches the entire begin text and not the octal escape null any backreferences that do not capture text or do not exist (in the begin regex only!!) are simply replaced with nothing (not (?!))

TextMate 2.0: only replaces backreferences \\[0-9], backreferences 0 to 9 (10 total) \0 backreferences the begin while \00 is a backreference \0 and literal 0 https://github.com/textmate/textmate/blob/master/Frameworks/parse/src/parse.cc#L136

begin = 'abc',
end = '\12'

the backference \12 matches the text 2 if you change abc to ab(c), then it matches c2 while vscode matches nothing. cause capture group 12 doesn't exist

you can then do things like this

"begin": "[0-9]+",
"end": "a{\\0}"

which will match any number then proceed to match that many a's sometime after it

I've used this in my YAML grammar for numeric indented block-scalars

"begin": "(?>(\\|)|(>))(?<chomp>[+-])?+([1-9])(?(<chomp>)|\\g<chomp>)?+",
"while": "\\G(?> {\\4}| *+($|[^#]))"

or switching between atomic and non-capturing

"begin": "([:>])",
"end": "(?\\1abc)"

Feb 06 '25 21:02 RedCMD

Thanks—it's definitely helpful to know the precise details. However, just so we're clear for other people reading along, none of that is directly related to the what I posted. E.g., my mention of replacing some NPCG backrefs with (?!) during transpilation was not related to anything vscode-textmate is doing.

both vscode-textmate and TextMate 2.0 physically replace the backreferences in end/while with the text captured in begin they do both escape most of regex's meta characters but there are some differences vscode-textmate TextMate 2.0 bug report

Funny enough, I recognize the list of chars vscode-textmate is escaping. It comes (directly or indirectly) from my XRegExp library, which has used exactly that list going back more than 15 years. It's the relevant list for XRegExp's regex flavor, but not precisely appropriate for native JS RegExp or Oniguruma.

Aside: Your vscode-textmate bug report spells out the \s whitespace chars as space, tab, and line feed. But since it's a JS \s in vscode-textmate's source, it matches much more than that. JS's \s matches everything in \p{White_Space}, plus U+FEFF, minus U+0085 (which is different from Oniguruma's \s in those latter two details).

"begin": "[0-9]+",
"end": "a{\\0}"

"begin": "(?>(\\|)|(>))(?<chomp>[+-])?+([1-9])(?(<chomp>)|\\g<chomp>)?+",
"while": "\\G(?> {\\4}| *+($|[^#]))"

"begin": "([:>])",
"end": "(?\\1abc)"

Oh my god, that's creative but cursed. 😆

IMO it's a bad idea to rely on using this in ways that wouldn't work if the search for backrefs got smarter. These examples are cool, but are (intentionally) subverting the intention of undocumented (and poorly designed) behavior that could change. You're also preventing these patterns from ever being evaluated/validated as regexes, without first going through TM-specific runtime mangling.

Feb 06 '25 23:02 slevithan

Now I see the problem. I need to dig deep to see if textmate could deviate multiple regex from one pattern, in that case it would be a bit trickier to engineer. If it's 1:1 mapping than we could the precompile applying textmate's changes (which we are not doing right now).

Feb 18 '25 09:02 antfu

If it's 1:1 mapping than we could the precompile applying textmate's changes (which we are not doing right now).

Not sure I'm understanding correctly, but I don't think we can do that. The replacement in the end/while regex uses part of the actual text matched by the corresponding begin pattern, so the replacement value varies based on the text you are highlighting.

Feb 18 '25 09:02 slevithan

In that case, we would bail out those patterns from precompile, and list them out, which would require the regular engine to run (so partially pre-compiled)

Feb 18 '25 10:02 antfu

Yes, I think that's exactly what's needed. The affected regexes that need the bail out are any that are specifically an end or while pattern and return true for /\\\d/.test(pattern).

Dynamic replacements are definitely needed for those few regexes at runtime. The question I was trying to address in my long comment above (which maybe had too much detail and wasn't very clear) was whether you could pre-transpile the affected regexes and then, at runtime, do a replacement on the pre-transpiled (JS RegExp) source to create the new (dynamic) regex. I argued that that's problematic for a couple reasons, and instead the replacement probably needs to happen on the original Oniguruma pattern. Only after the runtime replacement should such regexes then be transpiled to JS.

Feb 18 '25 10:02 slevithan

depending on the complexity of the regex you could precompile every possible outcome for simple cases like ("|') it'll work well

Feb 18 '25 11:02 RedCMD

Although I agree that would be cool for very simple cases with only a handful of possible exact matches (" or ' in your example), it's not the right place to start since (1) it wouldn't be reasonable or possible for all cases anyway, and (2) it would add a lot of complexity to the precompiled grammar system.

At a high level, doing that would require the following:

While precompiling the grammar, identify affected end/while regexes.
- Determine how many substitutions are needed. If it's a low number (one, maybe two), then...
- Do AST-based analysis of the corresponding begin pattern. If the corresponding capturing group numbers can only match a small number of possible strings, make all possible transformations to an AST for the end/while pattern and generate the resulting pattern for each change.
Add a new system that, before running an affected end/while pattern at runtime, checks what was captured in subpatterns of the corresponding begin pattern, and then selects the corresponding precompiled end/while regex to run.

So yeah, not trivial, and you'd still need the alternative system on top of that which simply transpiles at run time, for cases that can't be precompiled in that way because they're too dynamic.

Feb 21 '25 18:02 slevithan