Tree-sitter 1.0 Checklist
In the not-too-distant future, I'd like to bump Tree-sitter's version to 1.0, indicating a greater degree of stability and completeness. After that I'd like to regenerate all of the parsers in the tree-sitter github org, and bump them to 1.0 as well. Before doing this, there are several important problems with the framework that I think should be fixed.
Tasks
-
[x] Unicode character properties - Support ECMAScript unicode property escapes in regexes.
- [x] Implement basic support for this construct (https://github.com/tree-sitter/tree-sitter/pull/906)
- [x] Regenerate all parsers to use unicode property escapes, fix any bugs that surface
-
[x] Partial Precedence Orderings - The integer precedence system makes some grammars shockingly difficult to maintain.
- [x] Enhance the precedence system to allow precedences to be expressed in a pairwise partial ordering instead of requiring a total ordering based on integers. (https://github.com/tree-sitter/tree-sitter/pull/939)
- [x] Update
tree-sitter-javascriptandtree-sitter-typescriptto use this more flexible precedence scheme. Right now, the integer precedence system is making it very difficult to continue development oftree-sitter-typescriptin particular, because of the mix of different conflicts between types and expressions. - Dynamic precedence should probably stay integer-only, for simplicity
-
[x] Grammars with many fields, aliases - By historical accident, generated parsers use too small an integer type (
uint8_t) for storing nodes' field and alias information. Parsers with large numbers of fields can cause integer overflows (https://github.com/tree-sitter/tree-sitter/issues/511)- [x] Start representing nodes'
production_idas auint16_t(https://github.com/tree-sitter/tree-sitter/pull/943) - [x] Strategy - Decide whether we're going to bother to maintain backward compatibility with old generated parsers, if so, the library code will need to become a bit more complicated in order to consume both binary formats.
- [x] Grammars - Regenerate all the parsers with the new representation.
- [x] Start representing nodes'
-
[x] Fix issues with the
get_columnexternal scanner API (https://github.com/tree-sitter/tree-sitter/pull/978) -
[x] CLI Ergonomics
- [x] Generate Rust bindings for parsers, and structure the Node.js bindings more consistently with the Rust ones (https://github.com/tree-sitter/tree-sitter/pull/948)
- [x] In
parsecommand, auto-detect UTF-16 files and decode them accordingly. This will help windows users who currently trip over the suggestedechocommand in the docs. (https://github.com/tree-sitter/tree-sitter/pull/2368) - [x] Support grammars defined as ECMAScript modules instead of CommonJS module.
- [x] Reduce Coupling to Node - Introduce some Tree-sitter specific
GRAMMAR_PATHsetting where the CLI will search for grammar modules, instead of relying onnode_modulesandnpm.
-
[ ] Mergeable Git Repos - Make it easier to collaborate on grammars by removing generated files from version control.
- [x] CLI commands - Add new
packandpublishsubcommands to the Tree-sitter CLI, for uploading tarballs and compiled.wasmfiles to the GitHub releases API. https://github.com/tree-sitter/tree-sitter/issues/730#issuecomment-736018228 - [ ] Cleanup - Remove generated files from all the grammar repos in the tree-sitter org
- [x] CLI commands - Add new
-
[ ] Documentation
- [x] Document the ability to match against supertypes in queries with the
expression/identifiersyntax. - [ ] Add more thorough explanations of LR conflicts, precedence, and dynamic conflict-resolution with GLR.
- [ ] Make it clear how to use Tree-sitter for basic syntax highlighting without the
tree-sitter-highlightrust crate (just using tree queries directly). - [x] Document the
tags.scmqueries used for code navigation on GitHub. #660 - [x] Create a CHANGELOG file and start maintaining it. #527
- [x] Document the ability to match against supertypes in queries with the
Stretch Goals
I'm recording these here even though they are a bit less urgent.
-
[ ] Incremental Parsing Perf - Enhance the external scanner API to allow for looser state comparisons, avoiding the catastrophic node-reuse failures seen in the HTML parser (https://github.com/tree-sitter/tree-sitter-html/issues/23)
- [ ] Figure out if the new scanner function can be made optional (with the parser generator inspecting
scanner.cto decide whether to link against a_comparefunction). - [ ] Update
tree-sitter-htmlto use this API, improving its incremental performance
- [ ] Figure out if the new scanner function can be made optional (with the parser generator inspecting
-
[x] Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled
.wasmfiles, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native. #1864
For anyone who is interested, please let me know if I've left important things off of this list ☝️ .
Reads like tag queries are not going to be a 1.0 feature?
An alternative to removing the generated files would be to let them be pushed automatically on master by a CI bot. User can create mergable PRs by not needing to change any generated files. In this repo https://github.com/neovim/nvim-lspconfig/blob/master/.github/workflows/docgen.yml user coot changes to a configuration and a bit updates the documentation after each push on master.
I think https://github.com/tree-sitter/tree-sitter/issues/516 should also be addressed, even if the function is marked experimental? At least document the behavior.
I would suggest to reduce implicitness:
- Provide
dsl.jsas a regular file shipped with thetree-sitter-clinpm package and make it possible to require it as a regular JS library. This would help to extend it easily and would reduce confusion for IDEs and auto completion functionality in them. Behavior whendsl.jsis embed in tree-sitter binary also would be good to save if thedsl.jswasn't required in the grammar file explicitly, this will allow to continue to usetree-sitterCLI as pretty standalone tool. Also this will make possible to separate thegrammar.jsongeneration in case of extended DSL or simplify its generation debugging as a regular node.js script. - If to talk about tree-sitter's independence it would be good that tree-sitter would have an embedded JS runtime #465 with a fallback to a system
node.jsif this is requested explicitly by some CLI parameter, IMO a deno library looks promising.
Also I saw that *.so files always have zeros in version spec like libtree-sitter.so.0.0 it would be good that minimal ABI compatible version would be reflected in the *.so.X.X suffix somehow.
Note that the version number in those file names aren’t the same as the 1.0 semver release that @maxbrunsfeld is proposing. If there are any backwards incompatible changes as part of putting together this release, we’d bump the SOVERSION to 1; if not, we’d keep it at 0. More details can be found here.
@Razzeee Tag queries are already done, but you're right that we still need to document them. I envision those mostly being documented in a GitHub-specific context, since there isn't much generally-useful functionally specific to Tags; it's mostly just a convention for tree queries that GitHub is using for code navigation. All of the broadly-useful stuff has been generalized into the query system. I added that to the TODOs around documentation though.
I think #516 should also be addressed, even if the function is marked experimental?
~Yeah, you're right about that API being broken. I'm inclined to just address that for 1.0 by marking the function as half-baked. For our use cases, the API was only ever needed for the Haskell parser, and then we discontinued development of that parser because it was hard to find a good subset of the language that was amenable to parsing with a context-free grammar. It could definitely be made to work some day, but I think it's low-priority for us. There is still a bit of work to do to get it to play properly with incremental parsing.~
Nevermind, this got fixed.
@Razzeee Tag queries are already done, but you're right that we still need to document them. I envision those mostly being documented in a GitHub-specific context, since there isn't much generally-useful functionally specific to Tags; it's mostly just a convention for tree queries that GitHub is using for code navigation. All of the broadly-useful stuff has been generalized into the query system. I added that to the TODOs around documentation though.
So you don't think tags make sense for others? I hoped, that it would help moving the queries towards the parser and thus having multiple projects consume these/improve these.
I think #516 should also be addressed, even if the function is marked experimental?
Yeah, you're right about that API being broken. I'm inclined to just address that for 1.0 by marking the function as half-baked. For our use cases, the API was only ever needed for the Haskell parser, and then we discontinued development of that parser because it was hard to find a good subset of the language that was amenable to parsing with a context-free grammar. It could definitely be made to work some day, but I think it's low-priority for us. There is still a bit of work to do to get it to play properly with incremental parsing.
Understandable, do I need to be worried about the incremental parsing bit? Moved our parser to use this on a regular basis now and it seemed good, after figuring out, while it always gets stuck...
Nice strech goals would be:
- https://github.com/tree-sitter/tree-sitter/pull/729
- https://github.com/tree-sitter/tree-sitter/issues/255
CLI commands - Add new
packandpublishsubcommands to the Tree-sitter CLI, for uploading tarballs and compiled.wasmfiles to the GitHub releases API.
This is awesome. Currently for Emacs, I have a custom package that compiles the grammar binaries for the 3 major platforms, and distributes them through GitHub Releases, in a single bundle. Having a standard tool for individual language package to do this on their own would be great.
Will the official language repositories start distributing these binaries through GitHub Releases as well? I think some GitHub actions on top of these subcommands would be very helpful for that.
@ubolonton I might not take on the automation of compilation and storage of binary files (except for wasm) right now. I was mostly planning to use GH releases to store tarballs of generated files like parser.c, to avoid having so many merge conflicts in development.
Add new pack and publish subcommands to the Tree-sitter CLI, for uploading tarballs and compiled .wasm files to the GitHub releases API.
~I find this item problematic; what about tree-sitter implementations that are not hosted on GitHub? What's the plan on how those should be redistributed?~
Never mind, I see now that this only applies only to tree-sitters in this org.
@WhyNotHugo Yes, to confirm, the plan is not to mandate any particular hosting platform. Those commands will be able to produce the generated artifacts without uploading them as a GitHub release.
@Razzeee I think you're right that the get_column problem is important. It's especially relevant now that tree-sitter-haskell has been revived from the dead (thanks @tek). I believe I've addressed all of the problems with that API.
while I agree, feel it's disappointing that it needed that to happen. as there have been other grammars suffering from it. still, thank you ❤️
It would be awesome to automate release process for all official tree-sitter tools, especially for tree-sitter-cli, for all official bindings Wasm, Rust, Node.js, Python, Haskell, Ruby and the Playground with its separately living parsers and keep all in sync with the core tree-sitter library releases. This would help to reduce misunderstanding and situations that some things work somewhere and somewhere don't.
Versions
Bindings
Notes
- For now tree-sitter-cli installation from the crate seems the bad idea, the crate is stuck in 2 years old version.
- #1122 - tree-sitter-highlight 0.19.2 does not compile with tree-sitter 0.19.5 - demonstrates an issue that changes in tree-sitter's Rust binding requires bumping version in all dependencies that use changed parts. Otherwise there need to be a CI check that would test that the last dependent can be built against all equal or higher versions of the dependence.
- I can't say about all bindings but Node and Python bindings use static linking to tree-sitter core library and this means that these are lag behind the core library and don't receiving core fixes and logic improvements synchronously. IMO that's the important reason why such updates need to be automated. This doesn't solves problem with the core lib features covering but at least bug fixes would be delivered in time.
I am not sure if this is actually possible - it would be also awesome if generated parser/runtime never segfaults. Showing errors, warnings, exiting - yes, but never segfaulting.
I am not sure if this is actually possible - it would be also awesome if generated parser/runtime never segfaults.
Obviously the library should never segfault. AFAIK, that's already the case. I think you're referencing https://github.com/tree-sitter/tree-sitter-c/issues/64, which I can't reproduce after stripping out third-party libraries.
If anyone is seeing Tree-sitter cause a segfaults, and you can reproduce the problem, please report it.
For anyone who is interested, please let me know if I've left important things off of this list .
Add generating bindings for Zig programming language. It's successor of C language.
It provides a lot of safety features, like Rust, and might be more because of runtime checks. Very low-level, like C. But at the same time syntax and safety and tooling of modern language. Very fast (faster than C)
tree-sitter should provide means to replace memory allocation functions at runtime. This allows us to link to tree-sitter as a library instead of embedding it.
+1 for better error messages. related comment
Native Library, WASM parsers I would love to use wasm in other runtimes. Currently I am only able to use wasm in JS. But I would want to use it in wasmer and I don't want to use the c version because the same parser is run in different runtimes.
For wasm target, how about wasm-bindgen, which can generate Rust
and Typescript binding at the same time.
Typescript typing is really useful when working with VSCode LSP(Language Server Protocol)
When do you plan to document parsing and query performance or some of the internal data structure to estimate those in complexity classes, so O(1|log n|n|..)?
I also do see the implementation docs look relative incomplete. so probably they also belong on the TODO list?
Background is that I wanted to check the datastructure + complexity of a query on editing. I was checking query.c and found https://github.com/tree-sitter/tree-sitter/blob/1fe0420f0fe3df633c4f8b80d0c991a7aa214eeb/lib/src/query.c#L672
// The entries are sorted by the patterns' root symbols, and lookups use a
// binary search. This ensures that the cost of this initial lookup step
// scales logarithmically with the number of patterns in the query.
which sounds good to mention in the docs. However this only mentions "initial lookup step" and not overall worst case performance per pattern match or is unlucky formulated.
Reason was that I wanted to check out of curiosity, how performance compares to this approach for bracket pair colorizing.
Forgive me if this is addressed elsewhere, but as a newcomer to tree-sitter, I'd love to be able to implement external scanners in a language like Rust, Nim, Zig, or something along those lines, rather than C/C++.
(Wonderful project by the way, thanks for everything!)
It should be possible if you export the same C ABI. I'd advise against it if possible though, because it imposes additional build dependencies when compiling the grammar.
Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled .wasm files, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native.
I write Rust application that is compiled to wasm32-unknown-unknown and uses Tree-sitter for parsing.
Right now I use Rust bindings to web-tree-sitter (which is: Javascript bindings to Tree-sitter compiled to wasm32-unknown-emscripten). It would be really nice if I could compile Tree-sitter library and parsers to wasm32-unknown-unknown.
Then I could link them with other Rust-to-WASM compiled stuff, and this way get rid of JavaScript-as-a-proxy between my WASM and Tree-sitter WASM.
- [ ] Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled
.wasmfiles, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native.
To clarify, do you want this to be WASM engine implementation agnostic, as per your link to wasm-c-api, or is it fine to just embed a specific WASM engine? I can think of no reason to stay agnostic off the top of my head, unless you have a compelling use-case.
This might go against the inner design or the philosophy, but from the standpoint of a parser author for a whitespace sensitive language, it would be great to have some first class support for whitespace matching (identation, dedentation, newlines) when writing a parser.
Edit: it would be great to have functions like these in the grammar DSL:
-
same_line()makes sure the rules passed must be on the same line -
same_indent()makes sure the rules passed must be at the same indentation (these both implyseq())
Also some functionality to perform lookahead and lookbehind in the grammar DSL and or regex would be incredible.