chroma
chroma copied to clipboard
Support for Sublime Text syntax definitions
I'd ❤️ more complex syntax highlighting similar to https://github.com/trishume/syntect
I want to highlight ObjC and ARM64 assembly and it looks great with the Rust highlighters, but not so much with the pygments based lexers.
I know that's a big ask, but was hoping it might already be on your radar? 🙏
I thought Sublime Text definitions are simpler and less complex, because they are just yaml files; you can't parse complex syntaxes with them. Is there something more to them?
You can do whatever you want with Chroma's Emitters and Mutators, is there something you can't do with them? Besides if yaml files can highlight those languages, there shouldn't be any need for custom Emitters and Mutators.
Ah perhaps the pygments lexers for ARM64 asm and ObjC are just not very thorough then? I'll try to get you a side-by-side comparison tomorrow. That'd be great if the solution was just to create a better ARM64 asm/ObjC lexer 👍
Here is the bat ST syntax file for ARM assembly - https://github.com/sharkdp/bat/blob/master/assets/syntaxes/02_Extra/Assembly%20(ARM).sublime-syntax
So you think that this could also be described in the pygments lexer syntax?
Oh! would it perhaps be possible to then consume sublime text/tmate style syntax defs and convert them to pygments style lexers? I could have swore I've tried that in the past and it didn't work out, but that was a few years ago I think?
Ah perhaps the pygments lexers for ARM64 asm and ObjC are just not very thorough then?
I haven't looked at them, but unfortunately that is the case for many lexers. Other things to keep in mind is that a particular theme might not be highlighting some tokens(which is again unfortunately the case for many themes). Try doom-one themes to be sure or just look at the tokens.
Also Sublime might have more token types, if that is the case, then maybe more token types are needed to be added to Chroma, but if that happens, it means that the themes need to be modified to support those tokens.
Oh! would it perhaps be possible to then consume sublime text/tmate style syntax defs and convert them to pygments style lexers?
There is already a converter for pygments lexers, so it probably is possible(unless I'm unaware of something of Sublime syntax definitions), don't know how difficult it would be though.
This is indeed on my radar! I have a local branch from a while back for this that builds a Chroma syntax on the fly from a Sublime syntax file.
The process is relatively straightforward in theory: parse the .sublime-syntax file and build a Chroma lexer dynamically. Unfortunately there are some complications:
- The regex engine used in Sublime syntax files is Oniguruma. Its syntax is very complex and there is no equivalent in Go. This is probably a deal breaker.
- The format of Sublime syntax files is also quite complex - though it is well documented, implementing all of the edge cases would be a significant amount of work. You can see this reflected in syntect's parser.
Another alternative is TextMate syntax files but alas, they too rely on Oniguruma.
There are Oniguruma packages for Go but they are C bindings, which would be onerous for Chroma to rely on.
It seems for me recently that ALL roads lead to cgo.... I HATE cgo!! It ruins all that is great about Go. I'm very glad that you are already thinking about this.
Have you looked at the C? I've re-written a few C libs to Go, it is always painful, but maybe Oniguruma isn't that big?
❯ loc
--------------------------------------------------------------------------------
Language Files Lines Blank Comment Code
--------------------------------------------------------------------------------
C 86 93278 10107 2764 80407
C/C++ Header 7 3250 414 282 2554
Python 7 1858 356 157 1345
Markdown 3 1404 498 0 906
Makefile 6 569 121 12 436
HTML 2 387 33 0 354
Plain Text 2 268 47 0 221
Autoconf 7 306 41 85 180
Bourne Shell 6 107 35 9 63
C++ 1 45 9 15 21
Batch 3 15 0 0 15
--------------------------------------------------------------------------------
Total 130 101487 11661 3324 86502
--------------------------------------------------------------------------------
😩 🔫
I don't think this is still necessary, but I mentioned earlier adding a side-by-side:
chroma w/
armasmlexer andnordstyle
bat w/
armlexer andnordstyle
@alecthomas It looks like chroma is using regexp2 now which appears to have support for lookarounds and the likes. Is this still a blocker?
Yes. Chroma has always used regexp2, but it does not support all of the syntax that Oniguruma does.
Is all of that needed? syntect has the option to use the fancy-regex crate which seems to boast roughly the same feature set as regexp2
If there is anything missing from regexp2 then I can work on porting fancy-regex to Go if that helps. It looks to be ~5k lines of Rust, so it would likely only take a couple of weeks
It's needed insomuch as any Sublime syntax definition can use any of Oniguruma's syntax it wants. As I mentioned before, I wrote a partial Sublime syntax parser, but regexp2 was unable to drive it due to missing syntax.
There are two aspects to the work:
- A sufficiently capable regexp parser.
- A parser/translator for the Sublime syntax definition files.