linguist icon indicating copy to clipboard operation
linguist copied to clipboard

Syntax highlighting is broken for some languages

Open lildude opened this issue 1 year ago • 7 comments

Opening this to have a referable issue.

As @Alhadis reported there is an issue with separate syntax highlighting engine that is causing syntax highlighting issues…

Something's changed with GitHub's handling of TextMate-powered highlighting. There's now a [runaway-match issue][1] with my [.gitconfig grammar][2], and I distinctly recall seeing no issues with it on GitHub until relatively recently. Compare GitHub's rendering of the file to that of Atom's:

GitHub screenshot Atom screenshot
GitHubAtom

This doesn't appear to be an issue with regex engine flavours, or how recent they are. I can see @jtbandes's [example][3] is affected by the same highlighting failing to terminate at the end pattern specified by the TextMate grammar.

This is not an issue with linguist and not likely to be an issue with the grammars we bundle.

An issue has been opened with the team that manages the highlighting engine and I'll keep this issue updated as I hear more.

lildude avatar Jan 02 '24 21:01 lildude

After looking into the reported examples a bit, I believe one possible root cause of these issues is that (?!\G) is not working in end patterns.

This would explain the following issues:

Swift

class A: B {
  let x: Int
}
protocol Foo {
  associatedtype T: Equatable
  var x: String { get }
}
protocol Foo {
  associatedtype T = Int
  var x: String { get }
}
protocol Foo {
  associatedtype T: Equatable = Int
  var x: String { get }
}
  • Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/8, https://github.com/jtbandes/swift-tmlanguage/issues/10, https://github.com/jtbandes/swift-tmlanguage/issues/12, https://github.com/jtbandes/swift-tmlanguage/issues/13
  • (?!\G) used here: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L880

gitconfig

[good]
b = c
[bad]
  • Reported at: https://github.com/github-linguist/linguist/pull/6603#issuecomment-1867175606
  • (?!\G) used here: https://github.com/Alhadis/language-etc/blob/f3e265b2c155eded4a1808d4b75136ffbd6406b5/grammars/gitconfig.cson#L317

TOML

[good]
b = c  # comment
[bad]
  • Reported at: https://github.com/github-linguist/linguist/pull/6603#issuecomment-1868001787
  • (?!\G) used here: https://github.com/textmate/toml.tmbundle/blob/e82b64c1e86396220786846201e9aa3f0a2d9ca2/Syntaxes/TOML.tmLanguage#L46-L47

jtbandes avatar Jan 02 '24 21:01 jtbandes

Here are a couple others which I'm not able to trace to a specific instance of (?!\G), so there might be other types of end patterns that also don't work... or maybe the problem is that \G isn't working in general?

class Foo<A, B> where A: Identifiable, B: Identifiable {
  var x: A
}
  • Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/11
  • Not entirely sure if this maps to the same issue. The patterns used here are a bit complex and \G is used in both begin and end.
  • Relevant patterns: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L735
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
  • Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/9
  • Not sure if this relates to the usage of \G or some other end pattern not working. Could be related to this end pattern: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L1614
let SE0168 = """   illegal
        foo (this part should not be illegal)
    illegal"""
  • Reported at: https://github.com/jtbandes/swift-tmlanguage/issues/7
  • \G is present, but the pattern uses match, not begin/end. Relevant patterns: https://github.com/jtbandes/swift-tmlanguage/blob/f839cb6fde7c5afa1e08c38625c95b80bb05c626/Swift.tmLanguage.yaml#L2282-L2283

jtbandes avatar Jan 02 '24 22:01 jtbandes

@lildude Just an FYI, \G and (?!\G) are two of the most important idioms in TextMate grammars, as they're essentially all authors have by way of maintaining state and structure, especially in multiline patterns.

It might be that the location of the last match normally dereferenced by \G is getting reset between pattern lookups.

Alhadis avatar Jan 02 '24 22:01 Alhadis

Not sure if this is the right place but tons of problems with Swift files recently...

Screenshot 2024-01-04 at 18 24 24 Screenshot 2024-01-04 at 18 24 33

Kaspik avatar Jan 04 '24 17:01 Kaspik

@Kaspik yes, it was the PR that changed the Swift grammar which first brought this to light.

lildude avatar Jan 04 '24 18:01 lildude

I can also report a recent issue in C++ highlighting: https://github.com/textmate/c.tmbundle/issues/79

One-line comments now cause the next token to be ignored, which can cause catastrophic failures:

// Here is a one-line comment
/*
 * This is a multi-line comment that's
 * broken because the opening token was missed.
 * Now because of the apostrophe, everything
 * looks like a character string.
 */
int main();

versus

/*
 * This is a multi-line comment that's
 * not broken.
 */
int main();

The former currently renders as: Screenshot 2024-02-20 at 15 47 04

EDIT: it does look like textmate C.plist is also using (?!\G) for its comments: https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/C.plist#L684

sethrj avatar Feb 20 '24 20:02 sethrj

Also broken for CMake:

# This is a comment
if(${higlighting} STREQUAL "it works")
endif()

versus

# This is a comment

if(${higlighting} STREQUAL "it works")
endif()

sethrj avatar Feb 22 '24 20:02 sethrj

Thanks, everyone, for your reports in this thread, and for your patience with these regressions. The issue is indeed the handling of ?!\G tokens; these were provided by a custom fork of PCRE v1, made by someone who’s since departed the company, that had been bitrotting for nine years (and unsurprisingly did not work on Apple Silicon). We’re currently investigating how we can support these tokens without sacrificing speed or maintainability; in the meantime, we’re moving as many languages as we can to use tree-sitter based highlighting, which does not suffer from these problems and provides a noticeable speedup in the blob view (a win all around).

patrickt avatar Feb 27 '24 20:02 patrickt

in the meantime, we’re moving as many languages as we can to use tree-sitter based highlighting

Uh… what languages, exactly? How many? I'm interested in how many of my TextMate grammars might be substituted for a lower-quality or buggier tree-sitter version…

Alhadis avatar Feb 27 '24 21:02 Alhadis

FWIW, VS Code seems to support these patterns. Perhaps if a new regex/scoping implementation is needed then that team could help with integrating theirs, which seems to be actively maintained: https://github.com/microsoft/vscode-textmate

Uh… what languages, exactly? How many? I'm interested in how many of my TextMate grammars might be substituted for a lower-quality or buggier tree-sitter version…

And unfortunately, it’s probably many of the highest-quality TM grammars that will need to be replaced, if these TM patterns are no longer supported 🙁

jtbandes avatar Feb 27 '24 21:02 jtbandes

Also, I forgot to ask: why is there an outdated fork of PCRE being used instead of, y'know, something contemporary? This isn't a fault with the TextMate grammar format itself, but the regex engine that powers its pattern-matching. Portable grammars (or at least those that I've written) can leverage a subset of regex features that's supported by both modern PCRE, Perl, and Oniguruma. At no point should a fork be needed…

Alhadis avatar Feb 27 '24 22:02 Alhadis

@Alhadis We have, for some time, been using tree-sitter for C, C#, CSS, Elixir, Gleam, Go, HTML, Java, JavaScript, JSDoc, Nix, PHP, Python, CodeQL, regular expressions, Ruby, Rust, TLA+, and TypeScript. This week I’m hoping to get to C++, .gitconfig and .gitmodules, XML, and R.

We were using an old PCRE fork because, as far as I can tell, neither PCRE1 nor PCRE2 provides equivalents of ONIG_OPTION_NOTBOS, ONIG_OPTION_NOTEOS, and ONIG_OPTION_NOTGPOS. As you can tell from the regressions, certain grammars depended on these flags. (VS Code forks and vendors its own version of Onigurama for this reason.)

As to the technical reasons behind this change, tree-sitter grammars are simply a better fit for the GitHub highlighting service. TextMate grammars made sense for their original use-case, but for the purpose of our highlighting service (a non-interactive, stateless RPC server, very different from any text editor), regular expressions are not competitive, speed-wise, with tree-sitter’s generated C, and speed is one of our most important desiderata, given the scale at which this service has to operate. The speed increase when switching from a TM to TS grammar is noticeable from an end-user perspective. However, as I mentioned above, we’re looking into how we can add back support for \G and ?!\G qualifiers as a stopgap measure.

@jtbandes If you see regressions, we’d be grateful if you’d report them so we can improve the whole tree-sitter ecosystem! VS Code is actually a bit of an exception w/r/t its continued use of TextMate grammars; neovim, Pulsar, Zed, and now Emacs are all using tree-sitter for highlighting. tree-sitter is very much the lingua franca at GitHub for anything parsing-related, and that’s unlikely to change. For what it’s worth, we’ve been very happy with the capabilities of tree-sitter for syntactically rich languages like Ruby (and these syntax highlighting queries are much more concise than the equivalent .tmlanguage JSON files; Ruby’s is only a few hundred SLOC). Tree-sitter grammars also compose well, which is great for languages like TSX where you have multiple grammars/highlighters that need to work in concert.

patrickt avatar Feb 27 '24 22:02 patrickt

Thanks for the update on your continued work on this @patrickt. You mentioned a few languages on your tree-sitter migration roadmap, but I noticed that Swift (which the original issue was reported about) is not on the list. Where can I follow your progress for the upcoming work?

tevelee avatar Feb 27 '24 23:02 tevelee

@tevelee Oops, I forgot to mention: tree-sitter-swift integration landed today. There are a few regressions, but also several bug fixes (e.g. the ones related to multiline strings). I’m working with someone on the Swift team to get to parity (and perhaps beyond!) what we had with the TextMate grammar.

patrickt avatar Feb 27 '24 23:02 patrickt

@patrickt Is the Swift stuff using https://github.com/alex-pinkus/tree-sitter-swift or a different grammar? Let me know if any help with that migration is needed — while I have not worked on the tree-sitter side (yet!), I have intimate knowledge of the TM grammar and recently updated it with support for highlighting regex literals, as well as some other newer language features which I believe the tree-sitter grammar currently does not support. If I can be of assistance, drop me an email or DM.

jtbandes avatar Feb 27 '24 23:02 jtbandes

@patrickt Hypothetically, if one were to write a proof-of-concept that demonstrated how TextMate grammars could be used to provide a structured parser grammar based on a combination of their regex patterns and properties associated with scopes that capture groups are tagged with… would that be something of interest to GitHub?

Alhadis avatar Feb 27 '24 23:02 Alhadis

@jtbandes Thank you for your help! We’re using the @alex-pinkus version, yes, and I believe (though don’t quote me on this) he and we have been talking about making it the official Swift grammar in the tree-sitter org. Highlighting of components in regex literals are exactly the kind of thing I don’t want to regress—though it’s EOD for me, so that’s a job for tomorrow morning.

@Alhadis I think that sounds extremely cool, and I’d like to see your results, but if by “of interest to GitHub” you mean “sufficient for us to prefer TextMate grammars over tree-sitter grammars,” then I’m afraid that the answer is no. If there are missing syntax-highlighting features or developer amenities in tree-sitter as compared to what TextMate provides, then that’s a bug in tree-sitter that we'd very much like to fix.

patrickt avatar Feb 27 '24 23:02 patrickt

but if by “of interest to GitHub” you mean “sufficient for us to prefer TextMate grammars over tree-sitter grammars,” then I’m afraid that the answer is no.

No, I mean “sufficient enough to disincentivise mass-migration of existing TextMate grammars to potentially inferior tree-sitter versions”.

If there are missing syntax-highlighting features or developer amenities in tree-sitter as compared to what TextMate provides, then that’s a bug in tree-sitter that we'd very much like to fix.

It's not a bug in tree-sitter. It's the very design of the system itself that feels over-engineered and extremely hostile to grammar developers. I don't doubt that tree-sitter is performant and well-optimised to GitHub's needs, but TextMate's simplicity and flexibility are its key strengths to a grammar author. Moreover, it's not bound by the restrictions of a formal parser grammar, and it lends itself well to informal, unstructured formats like logs and program output.

Alhadis avatar Feb 27 '24 23:02 Alhadis

Be cool man, let's leave this thread to report issues with the grammar.

Thanks @patrickt for the work you put in to syntax highlighting and for the detailed update! It would've been very easy to leave us in the dark so your posts are much appreciated.

sethrj avatar Feb 27 '24 23:02 sethrj

No, I mean “sufficient enough to disincentivise mass-migration of existing TextMate grammars to potentially inferior tree-sitter versions”.

I’m afraid not, no. I appreciate your concerns and sympathize with your frustration, but this decision is set in stone at this point. The set of trade-offs with which we are working render us willing to accept temporary regressions (all of which, in my experience, have been trivially fixable) for the several axes on which tree-sitter provides concrete improvements.

patrickt avatar Feb 27 '24 23:02 patrickt

github_matlab_issue

MATLAB source is also having issues in that GitHub classifies anything in the document after the first % comment to be a comment.

psmskelton avatar Mar 07 '24 00:03 psmskelton

Review of matlab code is quite challenging on GitHub: image

rybalkoss avatar Mar 07 '24 09:03 rybalkoss

Hey, everyone 👋 Thanks to some absolutely heroic work on the part of @dcreager, we’ve addressed the regressions people have been observing. Please let us know if you spot any further infelicities, and thank you all very much for your patience while we fixed these issues.

patrickt avatar Mar 12 '24 17:03 patrickt

@patrickt I was literally about to point out how the regression appears to be fixed. Thanks for all your hard work!

Alhadis avatar Mar 12 '24 17:03 Alhadis

Thanks for your hard work. I’m sad to see Swift.tmLanguage go since it’s my baby, but I guess this will put more pressure on tree-sitter-swift to support all the language features. (Just look at all these things that are no longer highlighted: https://github.com/jtbandes/swift-tmlanguage/blob/main/grammar-test.swift) Maybe I will get around to contributing to it some day.

jtbandes avatar Mar 12 '24 17:03 jtbandes

@patrickt As of now, still seeing some issues with Swift files (like 'lazy', or 'dataSource' / 'delegate'), but overall seems much better! (or current PR just has no specific code that was problematic 😄 )

Screenshot 2024-03-12 at 18 59 51

Kaspik avatar Mar 12 '24 18:03 Kaspik

@Kaspik Those issues would be due to the switch from Swift.tmLanguage to tree-sitter-swift.

jtbandes avatar Mar 12 '24 18:03 jtbandes

Interestingly, I'm still seeing the Swift tmLanguage grammar being used in some places. It looks like it's used for diffs in Markdown files but not Swift files: https://gist.github.com/jtbandes/34a7a9fcc972f1d4d3f3ca6a39b86ccd/revisions

image

jtbandes avatar Mar 12 '24 18:03 jtbandes

@jtbandes Keen eye regarding Markdown—yes, the fact that embedded Markdown goes through the legacy TextMate path is an implementation detail we hope to fix someday (right now Markdown itself does not go through tree-sitter, which would be a first step, but after that it also requires some thought re. implementation). Thank you also for the Swift grammar test file you’ve written—I hadn’t seen this before, and this will greatly assist getting tree-sitter-swift to parity with what we had in Markdown. I’ve already landed a few fixes (here, here), and am going to knock more out tomorrow and also update the grammar GitHub-side. Hoped to get to that today but, as always, so much to do.

patrickt avatar Mar 12 '24 22:03 patrickt

I think having Markdown done through old implementation is actually a good thing because you have direct comparison what's needed and left to be improved in the implementation. The Markdown code above is much easier to read than the Swift file. I wouldn't migrate Markdown till the Swift one fixed.

Kaspik avatar Mar 13 '24 08:03 Kaspik