linguist
linguist copied to clipboard
Syntax highlighting is broken for some languages
Opening this to have a referable issue.
As @Alhadis reported there is an issue with separate syntax highlighting engine that is causing syntax highlighting issues…
Something's changed with GitHub's handling of TextMate-powered highlighting. There's now a [runaway-match issue][1] with my [
.gitconfig
grammar][2], and I distinctly recall seeing no issues with it on GitHub until relatively recently. Compare GitHub's rendering of the file to that of Atom's:
GitHub | Atom |
---|
This doesn't appear to be an issue with regex engine flavours, or how recent they are. I can see @jtbandes's [example][3] is affected by the same highlighting failing to terminate at the
end
pattern specified by the TextMate grammar.
This is not an issue with linguist and not likely to be an issue with the grammars we bundle.
An issue has been opened with the team that manages the highlighting engine and I'll keep this issue updated as I hear more.
After looking into the reported examples a bit, I believe one possible root cause of these issues is that (?!\G)
is not working in end
patterns.
This would explain the following issues:
Swift
class A: B {
let x: Int
}
protocol Foo {
associatedtype T: Equatable
var x: String { get }
}
protocol Foo {
associatedtype T = Int
var x: String { get }
}
protocol Foo {
associatedtype T: Equatable = Int
var x: String { get }
}
- Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/8, https://github.com/jtbandes/swift-tmlanguage/issues/10, https://github.com/jtbandes/swift-tmlanguage/issues/12, https://github.com/jtbandes/swift-tmlanguage/issues/13
-
(?!\G)
used here: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L880
gitconfig
[good]
b = c
[bad]
- Reported at: https://github.com/github-linguist/linguist/pull/6603#issuecomment-1867175606
-
(?!\G)
used here: https://github.com/Alhadis/language-etc/blob/f3e265b2c155eded4a1808d4b75136ffbd6406b5/grammars/gitconfig.cson#L317
TOML
[good]
b = c # comment
[bad]
- Reported at: https://github.com/github-linguist/linguist/pull/6603#issuecomment-1868001787
-
(?!\G)
used here: https://github.com/textmate/toml.tmbundle/blob/e82b64c1e86396220786846201e9aa3f0a2d9ca2/Syntaxes/TOML.tmLanguage#L46-L47
Here are a couple others which I'm not able to trace to a specific instance of (?!\G)
, so there might be other types of end
patterns that also don't work... or maybe the problem is that \G
isn't working in general?
class Foo<A, B> where A: Identifiable, B: Identifiable {
var x: A
}
- Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/11
- Not entirely sure if this maps to the same issue. The patterns used here are a bit complex and
\G
is used in bothbegin
andend
. - Relevant patterns: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L735
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
- Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/9
- Not sure if this relates to the usage of
\G
or some other end pattern not working. Could be related to this end pattern: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L1614
let SE0168 = """ illegal
foo (this part should not be illegal)
illegal"""
- Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/7
-
\G
is present, but the pattern usesmatch
, notbegin
/end
. Relevant patterns: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L2282-L2283
@lildude Just an FYI, \G
and (?!\G)
are two of the most important idioms in TextMate grammars, as they're essentially all authors have by way of maintaining state and structure, especially in multiline patterns.
It might be that the location of the last match normally dereferenced by \G
is getting reset between pattern lookups.
Not sure if this is the right place but tons of problems with Swift files recently...
@Kaspik yes, it was the PR that changed the Swift grammar which first brought this to light.
I can also report a recent issue in C++ highlighting: https://github.com/textmate/c.tmbundle/issues/79
One-line comments now cause the next token to be ignored, which can cause catastrophic failures:
// Here is a one-line comment
/*
* This is a multi-line comment that's
* broken because the opening token was missed.
* Now because of the apostrophe, everything
* looks like a character string.
*/
int main();
versus
/*
* This is a multi-line comment that's
* not broken.
*/
int main();
The former currently renders as:
EDIT: it does look like textmate C.plist is also using (?!\G)
for its comments: https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/C.plist#L684
Also broken for CMake:
# This is a comment
if(${higlighting} STREQUAL "it works")
endif()
versus
# This is a comment
if(${higlighting} STREQUAL "it works")
endif()
Thanks, everyone, for your reports in this thread, and for your patience with these regressions. The issue is indeed the handling of ?!\G
tokens; these were provided by a custom fork of PCRE v1, made by someone who’s since departed the company, that had been bitrotting for nine years (and unsurprisingly did not work on Apple Silicon). We’re currently investigating how we can support these tokens without sacrificing speed or maintainability; in the meantime, we’re moving as many languages as we can to use tree-sitter based highlighting, which does not suffer from these problems and provides a noticeable speedup in the blob view (a win all around).
in the meantime, we’re moving as many languages as we can to use tree-sitter based highlighting
Uh… what languages, exactly? How many? I'm interested in how many of my TextMate grammars might be substituted for a lower-quality or buggier tree-sitter version…
FWIW, VS Code seems to support these patterns. Perhaps if a new regex/scoping implementation is needed then that team could help with integrating theirs, which seems to be actively maintained: https://github.com/microsoft/vscode-textmate
Uh… what languages, exactly? How many? I'm interested in how many of my TextMate grammars might be substituted for a lower-quality or buggier tree-sitter version…
And unfortunately, it’s probably many of the highest-quality TM grammars that will need to be replaced, if these TM patterns are no longer supported 🙁
Also, I forgot to ask: why is there an outdated fork of PCRE being used instead of, y'know, something contemporary? This isn't a fault with the TextMate grammar format itself, but the regex engine that powers its pattern-matching. Portable grammars (or at least those that I've written) can leverage a subset of regex features that's supported by both modern PCRE, Perl, and Oniguruma. At no point should a fork be needed…
@Alhadis We have, for some time, been using tree-sitter for C, C#, CSS, Elixir, Gleam, Go, HTML, Java, JavaScript, JSDoc, Nix, PHP, Python, CodeQL, regular expressions, Ruby, Rust, TLA+, and TypeScript. This week I’m hoping to get to C++, .gitconfig and .gitmodules, XML, and R.
We were using an old PCRE fork because, as far as I can tell, neither PCRE1 nor PCRE2 provides equivalents of ONIG_OPTION_NOTBOS
, ONIG_OPTION_NOTEOS
, and ONIG_OPTION_NOTGPOS
. As you can tell from the regressions, certain grammars depended on these flags. (VS Code forks and vendors its own version of Onigurama for this reason.)
As to the technical reasons behind this change, tree-sitter grammars are simply a better fit for the GitHub highlighting service. TextMate grammars made sense for their original use-case, but for the purpose of our highlighting service (a non-interactive, stateless RPC server, very different from any text editor), regular expressions are not competitive, speed-wise, with tree-sitter’s generated C, and speed is one of our most important desiderata, given the scale at which this service has to operate. The speed increase when switching from a TM to TS grammar is noticeable from an end-user perspective. However, as I mentioned above, we’re looking into how we can add back support for \G
and ?!\G
qualifiers as a stopgap measure.
@jtbandes If you see regressions, we’d be grateful if you’d report them so we can improve the whole tree-sitter ecosystem! VS Code is actually a bit of an exception w/r/t its continued use of TextMate grammars; neovim, Pulsar, Zed, and now Emacs are all using tree-sitter for highlighting. tree-sitter is very much the lingua franca at GitHub for anything parsing-related, and that’s unlikely to change. For what it’s worth, we’ve been very happy with the capabilities of tree-sitter for syntactically rich languages like Ruby (and these syntax highlighting queries are much more concise than the equivalent .tmlanguage JSON files; Ruby’s is only a few hundred SLOC). Tree-sitter grammars also compose well, which is great for languages like TSX where you have multiple grammars/highlighters that need to work in concert.
Thanks for the update on your continued work on this @patrickt. You mentioned a few languages on your tree-sitter migration roadmap, but I noticed that Swift (which the original issue was reported about) is not on the list. Where can I follow your progress for the upcoming work?
@tevelee Oops, I forgot to mention: tree-sitter-swift integration landed today. There are a few regressions, but also several bug fixes (e.g. the ones related to multiline strings). I’m working with someone on the Swift team to get to parity (and perhaps beyond!) what we had with the TextMate grammar.
@patrickt Is the Swift stuff using https://github.com/alex-pinkus/tree-sitter-swift or a different grammar? Let me know if any help with that migration is needed — while I have not worked on the tree-sitter side (yet!), I have intimate knowledge of the TM grammar and recently updated it with support for highlighting regex literals, as well as some other newer language features which I believe the tree-sitter grammar currently does not support. If I can be of assistance, drop me an email or DM.
@patrickt Hypothetically, if one were to write a proof-of-concept that demonstrated how TextMate grammars could be used to provide a structured parser grammar based on a combination of their regex patterns and properties associated with scopes that capture groups are tagged with… would that be something of interest to GitHub?
@jtbandes Thank you for your help! We’re using the @alex-pinkus
version, yes, and I believe (though don’t quote me on this) he and we have been talking about making it the official Swift grammar in the tree-sitter org. Highlighting of components in regex literals are exactly the kind of thing I don’t want to regress—though it’s EOD for me, so that’s a job for tomorrow morning.
@Alhadis I think that sounds extremely cool, and I’d like to see your results, but if by “of interest to GitHub” you mean “sufficient for us to prefer TextMate grammars over tree-sitter grammars,” then I’m afraid that the answer is no. If there are missing syntax-highlighting features or developer amenities in tree-sitter as compared to what TextMate provides, then that’s a bug in tree-sitter that we'd very much like to fix.
but if by “of interest to GitHub” you mean “sufficient for us to prefer TextMate grammars over tree-sitter grammars,” then I’m afraid that the answer is no.
No, I mean “sufficient enough to disincentivise mass-migration of existing TextMate grammars to potentially inferior tree-sitter versions”.
If there are missing syntax-highlighting features or developer amenities in tree-sitter as compared to what TextMate provides, then that’s a bug in tree-sitter that we'd very much like to fix.
It's not a bug in tree-sitter. It's the very design of the system itself that feels over-engineered and extremely hostile to grammar developers. I don't doubt that tree-sitter is performant and well-optimised to GitHub's needs, but TextMate's simplicity and flexibility are its key strengths to a grammar author. Moreover, it's not bound by the restrictions of a formal parser grammar, and it lends itself well to informal, unstructured formats like logs and program output.
Be cool man, let's leave this thread to report issues with the grammar.
Thanks @patrickt for the work you put in to syntax highlighting and for the detailed update! It would've been very easy to leave us in the dark so your posts are much appreciated.
No, I mean “sufficient enough to disincentivise mass-migration of existing TextMate grammars to potentially inferior tree-sitter versions”.
I’m afraid not, no. I appreciate your concerns and sympathize with your frustration, but this decision is set in stone at this point. The set of trade-offs with which we are working render us willing to accept temporary regressions (all of which, in my experience, have been trivially fixable) for the several axes on which tree-sitter provides concrete improvements.
MATLAB source is also having issues in that GitHub classifies anything in the document after the first %
comment to be a comment.
Review of matlab code is quite challenging on GitHub:
Hey, everyone 👋 Thanks to some absolutely heroic work on the part of @dcreager, we’ve addressed the regressions people have been observing. Please let us know if you spot any further infelicities, and thank you all very much for your patience while we fixed these issues.
@patrickt I was literally about to point out how the regression appears to be fixed. Thanks for all your hard work!
Thanks for your hard work. I’m sad to see Swift.tmLanguage go since it’s my baby, but I guess this will put more pressure on tree-sitter-swift to support all the language features. (Just look at all these things that are no longer highlighted: https://github.com/jtbandes/swift-tmlanguage/blob/main/grammar-test.swift) Maybe I will get around to contributing to it some day.
@patrickt As of now, still seeing some issues with Swift files (like 'lazy', or 'dataSource' / 'delegate'), but overall seems much better! (or current PR just has no specific code that was problematic 😄 )
@Kaspik Those issues would be due to the switch from Swift.tmLanguage to tree-sitter-swift.
Interestingly, I'm still seeing the Swift tmLanguage grammar being used in some places. It looks like it's used for diffs in Markdown files but not Swift files: https://gist.github.com/jtbandes/34a7a9fcc972f1d4d3f3ca6a39b86ccd/revisions
@jtbandes Keen eye regarding Markdown—yes, the fact that embedded Markdown goes through the legacy TextMate path is an implementation detail we hope to fix someday (right now Markdown itself does not go through tree-sitter, which would be a first step, but after that it also requires some thought re. implementation). Thank you also for the Swift grammar test file you’ve written—I hadn’t seen this before, and this will greatly assist getting tree-sitter-swift
to parity with what we had in Markdown. I’ve already landed a few fixes (here, here), and am going to knock more out tomorrow and also update the grammar GitHub-side. Hoped to get to that today but, as always, so much to do.
I think having Markdown done through old implementation is actually a good thing because you have direct comparison what's needed and left to be improved in the implementation. The Markdown code above is much easier to read than the Swift file. I wouldn't migrate Markdown till the Swift one fixed.