tree-edit icon indicating copy to clipboard operation
tree-edit copied to clipboard

Hard-coded grammars are difficult to maintain

Open ubolonton opened this issue 2 years ago • 2 comments

Since compiled grammar binaries come from other sources, they can quite easily become out-of-sync with the hard-coded grammars.

It's better for the compiled form and the source form (whether json or sexp) of a grammar to be distributed together. IMO there are 2 options for this:

  1. tree-sitter would modify the generated code to embed the source forms (json) in the compiled binaries, and expose additional APIs to extract them. elisp-tree-sitter would wrap these APIs to provide the Lisp forms.
  2. tree-sitter-langs would modify its build tools to distribute the source forms together with the compiled forms (either embedded, or in separate files). elisp-tree-sitter would provide additional APIs to extract them, in Lisp forms.

I think 1. is the cleaner long-term solution, and we should discuss it with upstream. In parallel, we can work on 2.

What do you think?

This project looks very cool, by the way.

ubolonton avatar Jan 24 '22 15:01 ubolonton

Side note: I noticed this:

tree-sitter-langs does not always have the most up-to-date grammars and is missing some languages. If this continues to be an issue a fork may be needed.

I think keeping the grammars up-to-date should be a community effort. I probably haven't done a good job organizing that. Grammar update PRs are welcome though. Relatedly, do you think the current way tree-sittter-langs uses git submodules to track grammars is too much of a hassle?

ubolonton avatar Jan 24 '22 15:01 ubolonton

Hey ubolonton!

This is something I admittedly haven't thought about but completely agree that this will become an issue if left unaddressed.

Point 1. would be ideal, I've started an issue on upstream about some of the issues I've ran into for this usecase (which sadly has not been responded to), but maybe we can add that onto the issue. Not sure how I can get more visibility on that issue though...

If we could add an API for this to tree-sitter-langs that would be awesome :).

I've been seriously considering using/requiring forked grammars for languages, as it's becoming painfully clear that tree-sitter grammars are not being designed for this usecase.

I.e. in Python types introduce unnecessary nesting, C hides the true types of fields, and so on. I've been trying to hack around these issues in hopes of not needing to diverge from the community's grammars but I think I'm reaching the limits of what can be hacked around.

I think this usecase is far more powerful and game changing than enabling highlighting etc, so my hope is to fork and clean up the grammars in a way that supports both structural editing and highlighting/querying cases, and eventually merge back into the community (possibly after some new TS APIs are exposed).

Not really directly related but curious what your thoughts are on forking are, and what changes to tree-sitter could be made to better support structural editing. This is also why I'm looking forward to installation for abritrary parsers via github.

This project looks very cool, by the way.

Thank you! Couldn't have done it without your packages ;)

I think keeping the grammars up-to-date should be a community effort. I probably haven't done a good job organizing that. Grammar update PRs are welcome though. Relatedly, do you think the current way tree-sittter-langs uses git submodules to track grammars is too much of a hassle?

My personal opinion is that it may be better to try to move away from a central language package (which puts alot of burden on you to maintain!) and just let people install grammars as they wish via. the method in the issue above. Maybe some sort of highlights API would be needed for packages to provide that as well.

Perhaps once that issue is merged in you could replace the submodules with whatever the new method is: (tsc-require "github/tree-sitter-python.."), though that would require users to build all their grammars (builds quite fast, though.)

Then a grammar update could be as simple as bumping a hash, and start the transition away from central language authority.

ethan-leba avatar Jan 24 '22 17:01 ethan-leba

Since the expected way to use tree sitter grammars now is compiling locally, we're just layering a grammar post-processing on top of that.

ethan-leba avatar Apr 07 '23 21:04 ethan-leba