Make it easier to use standard iterators as a token source
Is there an existing issue for this?
- [x] I have searched the existing issues
What problem does this feature solve?
Currently Formatter.Format accepts an Iterator argument, but it's hard (or at any rate less performant) to adapt a standard library iterator (e.g. iter.Seq[chroma.Token]) into this form.
Specifically, we can do it with something like this code but iter.Pull is considerably slower than a straight function invocation so this is less than ideal. Alternatively we could use slices.Collect to collect the tokens into a slice and then define a simple chroma.Iterator to iterate over those items, but that ends up allocating the entire slice twice, because the first thing that Format does is collect all the tokens into a slice!
What feature do you propose?
The easiest thing here might be to provide an entry point that takes a slice rather than an iterator: essentially just export Formatter.writeHTML as is.
Alternatively, perhaps define
type IteratorV2 iter.Seq[Token]
and define new entry points in terms of that (it's straightforward to adapt the old entry points to use that interface).
Alternatively (most invasive) define an entirely new API in terms of the new stdlib iterator standard.
Coincidentally, I just created a branch last week that switches Chroma to stdlib iterators. It would be a breaking change though, so I'd need to cut a v2.
Though what's the actual use case that you're thinking of?
The actual use case we're thinking of is syntax highlighting in the CUE Central Registry source view (not yet landed yet, so I can't link to an actual example). The tokenization logic is more complex than is easily expressed with regular expressions, so it seems to make sense to use the actual scanner package to determine tokens. However, that scanner does not return tokens for white space, so there's not a one-to-one correspondence between calls to the token iterator function and calls to Scanner.Scan.
With a push-based iterator, it's pretty trivial to write the token producer; something like this (untested): https://go.dev/play/p/Rl-KN4lurS1
With a pull-based iterator such as chroma.Iterator it's harder to do and quite easy to get wrong - we need to store a pending token while we return white space and return it in a subsequent call.
Hope that makes sense!
I'd very much support making a v2 version if you're up for it: it makes a lot of sense to use the standard iterator pattern now that there is one.
Coincidentally, I just created a https://github.com/alecthomas/chroma/pull/1144 last week that switches Chroma to stdlib iterators. It would be a breaking change though, so I'd need to cut a v2.
In case you might find it useful, I added a few drive-by comments on that PR.