expressive-code Feature: option to add transformers to shiki

I'd like an option for the shiki plugin to pass in transformers.

Background: I recently pubilshed a shiki plugin for VSCode-style bracket pair coloring. I'd love to use it in combination with expressive code, but there's no option to pass in shiki transformers.

I believe this change is also required for #95 (I think that probably requires additional work, but I'm not very familiar with twoslash)

I'm willing to contribute this change.

Aug 06 '24 01:08 oatmealproblem

I did a bit of investigation, and this is a little trickier than I first thought. Shiki does not call transformers in codeToTokensBase (which is what expressive code uses). I see two options:

Call highlighter.codeToHast instead of highlighter.codeToTokensBase then process shiki's hast instead of shiki's ThemedTokens (this would be a pretty big change)
Manually call transformer.preprocess and transformer.tokens functions (this is pretty straightforward to implement, but it wouldn't work with transformers that use hooks other than preprocess and tokens)

Aug 06 '24 02:08 oatmealproblem

Hello Michael and welcome!

Yes, being able to use Shiki transformers would be really great. Unfortunately they are basically a slightly different implementation of my plugin concept and were added much later after EC's plugin API was already released, so I wasn't able to align my implementation to Shiki's to be compatible.

I have thought about the options to make them compatible, too, but didn't find any sensible solution so far.

Working with Shiki's hast output (which might have been heavily processed by transformers added by the user!) would make if really difficult to foresee what output I'd be dealing with and introduce lots of possibilities for incompatibilities and bugs.
Re-implementing Shiki's plugin API ourselves and acting like a host to transformers would be challenging as well due to many subtle differences in the hooks and incompatibilities between the hast nodes created by EC and the ones expected by transformers. Creating and having to maintain an adapter for a secondary plugin API will probably become a huge chore as well.
Another more extreme path might be to migrate all of EC's existing plugins to Shiki transformers, make Shiki the new required base processing layer of EC rather than an optional pluggable syntax highlighting engine, and then hope for the best that people will not attempt to combine Shiki transformers that are incompatible with EC's own Shiki transformers.

One of EC's design goals was to offer a batteries included solution that "just works" out of the box, no matter what combination of plugins and features is chosen by the user. I feel like the alternatives I can think of so far would all violate that goal, on top of requiring a ton of work and maintenance effort, which I'm not able to provide on my limited time.

Oh, I almost forgot to add that I really appreciate your thoughtful investigation of the options and offer to help! The challenge is unfortunately not easy to solve, so I dove right into the problems. :) Sorry about that!

Aug 06 '24 08:08 hippotastic

Thanks for the response, and no worries about diving into the difficulties :)

Working with Shiki's hast output (which might have been heavily processed by transformers added by the user!) would make if really difficult to foresee what output

I agree that processing Shiki's hast probably isn't a viable solution. At that point, it could be any sort of html

Re-implementing Shiki's plugin API ourselves and acting like a host to transformers would be challenging as well due to many subtle differences in the hooks and incompatibilities between the hast nodes

What would you think about supporting the pre-hast shiki hooks (preprocess and tokens)? Those changes seem pretty minimal (I have that mostly implemented locally), and they can't do anything crazy with the html that would break EC. The main downside I see is that it wouldn't be clear to the users of transformers whether or not they are EC-compatibility -- they would need to examine the source code and see which hooks are being used. That could be pretty confusing :/

How about instead of exposing transformers to the direct users of EC, you expose them to plugin authors? That would make it easy for shiki transformer authors to wrap their transformer up as an EC plugin, and it's much more reasonable to expect plugin authors to read the documentation about what subset of shiki transformer hooks are supported.

Some more background on my use-case: my bracket colorizing transformer piggy-backs off of shiki's tokenization. Currently, if I wanted to make an EC plugin for it, I would need to call shiki codeToTokens in my plugin. That's (potentially expensive) work that EC has already done, so it'd be nice to not have to do that twice (or even more times for more plugins).

I understand if my use-case is too niche and not worth the implementation and ongoing maintenance. If so, I'll look into to writing it as a EC plugin that calls codeToTokens again, as I mention above, or potentially fork @expressive-code/plugin-shiki to support my usecase without duplicate codeToTokens calls

Another more extreme path might be to migrate all of EC's existing plugins to Shiki transformers, make Shiki the new required base processing layer of EC rather than an optional pluggable syntax highlighting engine

While extreme, that honestly sounds great for unifying the ecosystems. A quick search on npm, I didn't see any EC plugins for using other syntax highlighters. It would be great if that's seriously considered long-term :)

One of EC's design goals was to offer a batteries included solution that "just works" out of the box, no matter what combination of plugins and features is chosen by the user.

I don't think anyone would reasonably expect a "just works" guarantee once they start using non-official plugins, whether that be Shiki transformers or the current EC plugin system. Compatibility is the responsibility of the community plugin authors

Aug 06 '24 15:08 oatmealproblem

I just had a look at your code to see if it would be easy to provide the required tokens to plugins like yours. I noticed that you're using includeExplanation: true in your tokenization call. Is the explanation really necessary for your plugin to function?

The reason why I'm asking is that the explanations seem to be an expensive operation (at least they were when I last tested them many months ago, making the call take almost twice as long - but maybe that has improved in the meantime?).

They are therefore disabled in EC's shiki plugin as EC is used in large-scale documentation projects where builds are already taking 5+ minutes, and I'd like to avoid making these builds even longer.

Aug 06 '24 18:08 hippotastic

Yes, it uses the textmate scopes in explanations for a few things:

avoid coloring brackets in strings and comments
use correct language-specific config for embedded languages (eg TS embedded in a .svelte file)
limit where brackets are matched (eg match < when used for a generic type but not when used as "less than")

I haven't done much benchmarking yet (on my to-do).

I'm open to suggestions, but haven't thought of a better solution myself.

Aug 06 '24 18:08 oatmealproblem

I haven't done much benchmarking yet (on my to-do).

Well, I just did some quick benchmarking. Looks like includeExplanation makes it ~4.7x slower. Definitely not something to enable by default. If you did add a way for a plugin to access the tokens, you could also add some way for a plugin to indicate if it needs token explanations.

My plugin doesn't actually use the full explanations. It just uses the scope names, but not the theme matches. Out of curiosity, I disabled that part of shiki's code. After that, it was only 2x slower. Still not something to enable lightly, but a big improvement vs the 4.7x slowdown! I'll see if the shiki maintainers are open to some sort of scopes-only setting (edit: submitted https://github.com/shikijs/shiki/issues/736)

Aug 07 '24 00:08 oatmealproblem

In shiki 1.13, you can now set includeExplanation: 'scopeName' to only include the scope names but not the theme matches, which only results in a ~2x slowdown instead of ~5x.

@hippotastic are you still considering adding a way for a plugin to manipulate the shiki tokens (and indicating if/what level of token explanation is needed)?

If not, I'll make a plugin that runs its own shiki tokenization; for my use case (code snippets on a blog) I don't expect the slowdown to make much difference either way

Aug 15 '24 14:08 oatmealproblem

Very nice work with the shiki contribution! Yes, I'd be happy to provide a way for plugins to access token information given that the slowdown isn't as drastic as it was before.

I can imagine a plugin signaling that token information is required through a property, which the shiki plugin will then use and attach the extra information to the annotations it creates.

This way, you could loop through the annotations in any of the hooks running after the syntax analysis, and get information about the scope names.

I can't promise how quickly I will get this done, so if you need it working rather soon, you could indeed first do your own shiki tokenization and switch to the built-in version as soon as it's ready. This approach would also help me validate that my API would work for your plugin by attempting to incorporate it into your then existing plugin code.

Does that sound good to you?

Aug 15 '24 14:08 hippotastic

Great :)

attach the extra information to the annotations it creates

loop through the annotations in any of the hooks running after the syntax analysis

It would be easier to adapt my transformer if there was a new plugin hook (like postProcessShikiTokens) where I could manipulate shiki's ThemedToken[][], before even the EC annotations have been created. I imagine that's easier for other shiki transformers too (though I'm not sure how many pre-hast-only shiki transformers there are; probably not too many). That hook would only work if you're is using the core shiki plugin, and I understand if you wouldn't want to add a plugin hook that only functions if a specific (albeit core) plugin is also used.

With your proposal, I would need to reconstruct the tokens based on the annotations, then find the matching brackets, then turn those back into annotations. (And maybe need to manipulate the existing annotations? Not sure how EC handles overlapping annotations.) In that case, I'd probably just stick with an EC plugin that runs its own shiki; slower but simpler.

Aug 15 '24 15:08 oatmealproblem

EC automatically handles overlapping InlineStyleAnnotations for you, so you'd not need to do anything special and not manipulate the existing ones.

Aug 15 '24 15:08 hippotastic

That makes it simpler, but still I'd need to partially reconstruct the tokens from the annotations, process those, then create new annotations. If there was a postProcessShikiTokens hook, I could use my transformer largely as-is to manipulate the tokens, then EC would pick up the highlighted brakcets when it creates annotations from the tokens as normal

Aug 15 '24 15:08 oatmealproblem

Could you please elaborate on why you need to "reconstruct" any tokens? What does that mean and what do you need it for?

I'm not saying that creating a similar hook is an issue, I'd just like to understand the use case better. So far I tried to keep the plugins all separate from each other and not introduce preventable dependencies between them.

Aug 15 '24 16:08 hippotastic

Currently my transformer works directly with shiki's themed tokens, finding matching bracket pairs and assigning them colors. It uses the content, offset, and explanation of each token to find the brackets. (I cannot use just the explanation, due to some edge-cases in how shiki handles whitespace.)

If I understand correctly, you're proposing adding some sort of explanation property to each annotation. I would need to take the annotations and turn them back into Shiki tokens to work with my code. Or change my code to work with some intermediate representation that I create from either shiki tokens (for the shiki transformer) or EC annotations (for the EC plugin). That's what I mean by "reconstructing"; getting the annotations into a form my code can work with.

I tried to keep the plugins all separate

Perhaps rather than a plugin hook, this could be a configuration option for the core shiki plugin. Where you can pass in additional langs, you could also pass in tokensTransformers and includeExplanation.

Aug 15 '24 16:08 oatmealproblem

Thank you for explaining the use case!

I agree that it might be simpler to make this options of the shiki plugin. This would also allow all transformers that just work on the Shiki tokens to be usable as-is with EC.

Aug 15 '24 17:08 hippotastic

I can work on a PR for that if you'd like

Aug 15 '24 17:08 oatmealproblem

Thank you, I'd welcome the contribution!

I hope it's going to be straightforward and won't cause unforeseen type issues etc. when updating Shiki to the latest version. This has sometimes been a bit tricky in the past. :)

Aug 15 '24 17:08 hippotastic

Screenshot from 2024-08-17 15-42-31 So far, everything is going smooth, no type issues. Got colored brackets on my 11ty blog now :) Just need to add some tests now, then I'll submit a PR

Aug 17 '24 20:08 oatmealproblem

@hippotastic I ran into an issue I'd appreciate some input on. The transformers do not work if they change the actual content of the code, since the shiki plugin only adds annotations but doesn't edit text. It works fine for my transformer, since that just changes color not content, but that's not necessarily the case for all transformers. So, a couple options:

simply don't support transformers that edit content; add that caveat to the documentation
- additional option: detect if a transformer edits content and throw an error
if transformers are used, edit the code block to match the transformed tokens
- if you want to do this, I'd appreciate a little guidance. I tested a solution simply uses deleteLines and insertLines to fully rewrite the block. That seems to work, but I'm not sure if there's any risks with that I'm unaware of. For reference, here's the implementation:

// if transformers are present, rewrite the code block to match the potentially transformed content
// only do this for the first style variant, since the content should be the same for all style variants
if (styleVariantIndex === 0 && options.tokensTransformers) {
	const allLineIndices = codeBlock.getLines().map((line, i) => i)
	codeBlock.deleteLines(allLineIndices)
	codeLines = codeBlock.insertLines(
		0,
		tokenLines.map((line) => line.map((token) => token.content).join(''))
	)
}

Aug 18 '24 02:08 oatmealproblem

Deleting and replacing all lines would also delete the annotations created by all other plugins in any hooks that run before shiki, so this is not a feasible solution.

I think the only options we have is to either (A) not support transformers that edit content, detect this and throw an error if they do, or (B) apply a diff algorithm to perform the smallest required edits to get from the original code to the one left after running the transformer.

In order not to complicate your (very appreciated) task of creating the PR, I'd be fine with going with (A) for now and then I'd maybe implement (B) in a future release.

Aug 18 '24 07:08 hippotastic

Sounds good 👍 I can't think of a use-case for tokens transformers that change the underlying text. Seems like anything that could be done with that would be better off as a preprocessor or postprocessor... So maybe no one will need support for content-editing tokens transformers anyways

Aug 18 '24 15:08 oatmealproblem

I agree. However, people will certainly try to pass incompatible transformers to this option ("look, there's transformer support!"), so it's good to have some sort of detection in place for now. :)

Hopefully I'll find a way in the future to more fully support transformers through some bigger architectural changes as discussed.

Aug 18 '24 15:08 hippotastic

This feature has just been released to NPM in v0.36.0. Thank you for suggesting it, giving valuable input in our discussions and leading the development effort to bring it to life. I'm happy to have you in the circle of Expressive Code contributors!

Aug 21 '24 11:08 hippotastic