tree-sitter icon indicating copy to clipboard operation
tree-sitter copied to clipboard

Improve `tree-sitter highlight --check` for downstream use

Open clason opened this issue 2 years ago • 9 comments

#2340 added a --check flag that ensures that highlight queries only use captures from an official list. This is also useful for downstream projects (Neovim, Helix, Emacs) that have a custom(ized) set of captures, to ensure they at least align on a "common core". To be able to use this test in their CI, however, requires that the test can be customized by

  • [x] allowing to provide a list of custom captures in a config file that extends (or overrides?) the hard-coded "common core" list;
  • [x] providing a flag (also for tree-sitter highlight itself) that disables early termination on matching query (since Neovim -- and Zed? -- always process and apply all queries in order);
  • [ ] optionally also checking predicates and directives against a "common core" (to be agreed on, and similarly extensible with a custom list).

@patrickt

clason avatar Jul 16 '23 09:07 clason

that disables early termination on matching query

If I understand you right and you means a behavior that for highlighting logic the last match wins instead of the first match like in Tree-sitter's current highlighting implementation. Then I think Tree-sitter's implementation just needs to migrate to the same logic like in Neovim because the last match win strategy:

  1. Allows to combine queries in order that intuitively is in sync with overriding concept familiar for many.
  2. Allows easily combine queries that capture a context before other context consuming queries.
  3. Verified on practice in Neovim and Zed too.

ahlinc avatar Jul 16 '23 11:07 ahlinc

Yes, exactly! I wouldn't advocate unconditionally changing this since it would be quite a breaking change for current users of the crate and CLI, but I understand that you'd rather not support two different strategies (and am happy that you seem to prefer "our" strategy, too 😄 ). Nvim-treesitter has a large number of queries written for this strategy, so hopefully those would serve as a useful example to make adapting relatively straightforward.

I'm assuming (hopefully correctly) that my suggestion is relatively straightforward to implement: just wrap the lines that break out early in a guard: https://github.com/tree-sitter/tree-sitter/blob/2a277879ab38ca6d61c5558ea870116b7fab9653/highlight/src/lib.rs#L929-L936

If you want to switch, I'd recommend a three-step deprecation strategy:

  1. add the "last match wins" behavior as a flag, default false
  2. make the default true
  3. remove the flag (and the code).

clason avatar Jul 16 '23 11:07 clason

If you want to switch, I'd recommend a three-step deprecation strategy:

It's hard to follow for such plan because potentially all highlight.scm files may need changed across all repos at least in the org and we wouldn't be able to maintain two copies.

If to look on https://crates.io/crates/tree-sitter-highlight crate and its downloading statistic I suspect there are no more consumers of it except tree-sitter-cli.

ahlinc avatar Jul 16 '23 11:07 ahlinc

If to look on https://crates.io/crates/tree-sitter-highlight crate and its downloading statistic I suspect there are no more consumers of it except tree-sitter-cli.

Those are only the public consumers, though. I suspect there's a number of closed-source users... In particular, Github uses the crate internally, IIUC.

It's hard to follow for such plan because potentially all highlight.scm files may need changes and we wouldn't be able to maintain two copies.

Not if the behavior was opt-in, for people who already have updated queries. To be clear, I'm not advocating that the tree-sitter org officially supports both orders at the same time (with their queries). My original ask was just to allow users who already have queries assuming different ordering to use them with the CLI.

But if you'd rather rip off the bandaid in one go, that's fine with us, too, of course. We'd be happy to help with updating the queries based on our downstream ones (using upstream captures, of course).

I'd still recommend a two-step strategy then (omitting step 2, and accelerating the time frame), so we can use the CLI to verify our queries before upstreaming them.

clason avatar Jul 16 '23 11:07 clason

I would love to update the upstream queries to use the same convention as Neovim. I actually think Neovim's makes more sense. The current state of affairs is a bit confusing for people.

@dcreager @patrickt (and any other GH folks) Do you see any issue with changing the convention used by tree-sitter-highlight if Tree-sitter maintainers change the highlights.scm queries to use the new convention?

It basically amounts to reversing the order of certain patterns in a query.

maxbrunsfeld avatar Jul 19 '23 00:07 maxbrunsfeld

@clason I'm fine with making this change, but maybe we could do something even simpler, and simply switch to the last-match-wins behavior all the time.

Regardless though, making the captures configurable would be good.

maxbrunsfeld avatar Jul 19 '23 00:07 maxbrunsfeld

I didn't even consider that that would be acceptable, so big 👍 for just switching unconditionally. That would remove one of the biggest hurdles for sharing (adapted) queries among upstream and editors!

clason avatar Jul 19 '23 05:07 clason

@maxbrunsfeld is this change still on the docket?

clason avatar Dec 28 '23 16:12 clason

optionally also checking predicates and directives against a "common core"

This may not be possible to implement (at least not easily or in an extensible manner).

Other than that, we're only missing an exit code if non-standard captures are found.

ObserverOfTime avatar Apr 12 '24 18:04 ObserverOfTime