nixpkgs icon indicating copy to clipboard operation
nixpkgs copied to clipboard

python3Packages.tree-sitter-grammars: init at 0.22.5

Open sarcasticadmin opened this issue 1 year ago • 7 comments

Description of changes

As request in https://github.com/ngi-nix/ngipkgs/issues/139

This generates a python binding package for each tree-sitter-grammar that exists in tree-sitter.builtGrammars.

Things done

  • Built on platform(s)
    • [x] x86_64-linux
    • [ ] aarch64-linux
    • [ ] x86_64-darwin
    • [ ] aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • [ ] sandbox = relaxed
    • [ ] sandbox = true
  • [ ] Tested, as applicable:
  • [ ] Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • [x] Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • [ ] (Package updates) Added a release notes entry if the change is major or breaking
    • [ ] (Module updates) Added a release notes entry if the change is significant
    • [ ] (Module addition) Added a release notes entry if adding a new NixOS module
  • [x] Fits CONTRIBUTING.md.

Add a :+1: reaction to pull requests you find important.

sarcasticadmin avatar Jun 18 '24 13:06 sarcasticadmin

For the record, this is already ready for review, even though it's marked as draft.

mightyiam avatar Jun 20 '24 10:06 mightyiam

The final commit message and PR title should be updated.

mightyiam avatar Jun 24 '24 10:06 mightyiam

I updated the final commit message. The PR title should be amended. @sarcasticadmin can do that.

mightyiam avatar Jun 24 '24 10:06 mightyiam

Did you consider taking a look at https://github.com/tree-sitter/py-tree-sitter before doing all the work here?

Yep, that's the first thing we've done and we are using it in the checkPhase

Taking a look at ngi-nix/ngipkgs#139 I'm honestly a bit confused by the shell.nix part of the request. Because you can just do something like: ... And get a shell with tree-sitter and the grammars you want, but I might be missing something.

This is right for exposing tree-sitter and tree-sitter-<lang> (grammars) to dev shells. What ngi-nix/ngipkgs#139 wants, is exposing Python bindings for multiple grammars to dev shells (original request)

Regarding armijnhemel/proximity_matcher_webservice and the extra context in https://discourse.nixos.org/t/need-help-enabling-grammars-in-treesitter-python/39500 from just googeling a bit, it seems like there is official tree-sitter python bindings https://github.com/tree-sitter/py-tree-sitter/ and we already have them packaged in nixpkgs.

py-tree-sitter is indeed in nixpkgs, but you'll need to load a grammar to it before you can parse anything (usage) What this PR adds, is to give py-tree-sitter in nixpkgs the ability to load any/all grammars exposed through tree-sitter.builtGrammars as python bindings Please note that Python bindings exposed through official tree-sitter org (it's mostly distributed through one of the maintainers' personal Pypi account) on Pypi is incomplete, 24 as of me writing this. With our way, we can expose 100+ bindings readily available in nixpkgs

It appears like our package needs a little additional work so you can configure grammars, but that should be an easy <15-min job since you just need to add the grammar python wheel to the pythonPath. (at least that's how I read the language section in the python binding readme

That's what we did in this PR: giving users the ability to get tree-sitter grammar python bindings, which will further allow py-tree-sitter to load the bindings and be able to parse code

stepbrobd avatar Jun 26 '24 03:06 stepbrobd

cc @doronbehar since you did a lot of work on tree-sitter python bindings in #316901 recently.

Janik-Haag avatar Jun 26 '24 18:06 Janik-Haag

I'm sorry but this PR is very confusing to me, and I tend to agree with most of @Janik-Haag's comments, and for sure using writeTextDir like that is way too messy.

I'm mostly confused because I didn't find any details in neither the discourse thread, nor at https://github.com/ngi-nix/ngipkgs/issues/139 and neither here about the target development setup. For example I wonder:

  • What text editor do you use?
  • Many text editors support installing such Grammers via the editor's package manager, why is that not an option? In the case of Neovim for example, we have an interface for that for sure (example usage). It is purely documented but still it is there.

I'm not familiar with the ecosystem of all of these python packages, including many of those I added in https://github.com/NixOS/nixpkgs/pull/316901 , so I won't be able to help a lot.

doronbehar avatar Jun 26 '24 22:06 doronbehar

Thanks. This is pretty much what I imagined in https://discourse.nixos.org/t/need-help-enabling-grammars-in-treesitter-python/39500/7?u=jtojnar.


@Janik-Haag

No, it does not, I think? You aren't touching the py-tree-sitter package, nor are you using py-tree-sitter in your code.

The grammars have nothing to do with py-tree-sitter package – the package just contains bindings to tree-sitter library itself, I am pretty sure.

And as you show in your example – you just pull in tree_sitter_javascript from PYTHONPATH. So I do not think the tree_sitter bindings module does need to be aware of the grammars at build time, passing the result of language() function to Language should be enough.

Yes, that's the point I made, and you aren't adding support for it here.

Again, no support is necessary in py-tree-sitter, we just need the grammar bindings, which is what this PR provides as mentioned above. It is creating them from scratch because most grammars are not actually available on PyPI. Also those that are available there are using vendored grammars and vendoring is generally undesirable.

All you would have needed to do is a small patch like:

I do not think this will do anything that could not be achieved by just adding the extraGrammars on Python path in the consumer. The grammars mentioned in the py-tree-sitter expression are only used for tests.

How can you be certain that your bindings work, you barely do any testing, there is a bunch of work going into this upstream to do this properly. And you are basically duplicating out of tree efforts like https://github.com/grantjenks/py-tree-sitter-languages, but with worse support and worse testing.

Yes, testing is an issue

You can also just use pkgs.python3Packages.tree-sitter-languages instead of pkgs.python3Packages.tree-sitter in the example above (and replace the language and parser definition) and get access to all the languages.

but tree-sitter-languages is a dead end as I mentioned in https://discourse.nixos.org/t/need-help-enabling-grammars-in-treesitter-python/39500/7?u=jtojnar

The only other alternative to have all those grammars available from Python with the current API (to my knowledge, as of 2024-04-12) would be going around and opening PRs to add a trivial binding like this https://github.com/tree-sitter/tree-sitter-python/tree/71778c2a472ed00a64abf4219544edbf8e4b86d7/bindings/python/tree_sitter_python for every upstream grammar, having them upload it to PyPI, and then packaging that. That does not scale.

We also don't need to depend on the wheels distributed through pypi, for example the tree-sitter-javascript package listed above, was built from source.

That would be fine but it still does not solve the issue for grammars that do not have Python bindings.

I would say the requirement to have m×n bindings for each combination of language and grammar itself is something that should be addressed but I do not have a solution for that 🤷‍♀️

jtojnar avatar Jul 02 '24 17:07 jtojnar

The only other alternative to have all those grammars available from Python with the current API (to my knowledge, as of 2024-04-12) would be going around and opening PRs to add a trivial binding like this https://github.com/tree-sitter/tree-sitter-python/tree/71778c2a472ed00a64abf4219544edbf8e4b86d7/bindings/python/tree_sitter_python for every upstream grammar, having them upload it to PyPI, and then packaging that. That does not scale.

Hi, I just want to mention that every recent enough grammar should have the binding, since the official template contains it: https://github.com/tree-sitter/tree-sitter-embedded-template.

So maybe, all (maintained) upstream grammars should at some point contain the binding.

adfaure avatar Jul 02 '24 19:07 adfaure

@adfaure That is actually a grammar for ERB and EJS template languages. But looks like a template does exist so you might be right that it will be everywhere eventually.

So maybe we could indeed use the upstream sources. For example, here is how I would convert python3.pkgs.tree-sitter-javascript (after applying #324161):

--- a/pkgs/development/python-modules/tree-sitter-javascript/default.nix
+++ b/pkgs/development/python-modules/tree-sitter-javascript/default.nix
@@ -1,6 +1,6 @@
 { lib
 , buildPythonPackage
-, fetchFromGitHub
+, tree-sitter-grammars
 , setuptools
 , wheel
 , tree-sitter
@@ -8,16 +8,9 @@
 
 buildPythonPackage rec {
   pname = "tree-sitter-javascript";
-  version = "0.21.3";
+  inherit (tree-sitter-grammars.tree-sitter-javascript) version src;
   pyproject = true;
 
-  src = fetchFromGitHub {
-    owner = "tree-sitter";
-    repo = "tree-sitter-javascript";
-    rev = "v${version}";
-    hash = "sha256-jsdY9Pd9WqZuBYtk088mx1bRQadC6D2/tGGVY+ZZ0J4=";
-  };
-
   build-system = [
     setuptools
     wheel

jtojnar avatar Jul 02 '24 21:07 jtojnar

@adfaure That is actually a grammar for ERB and EJS template languages. But looks like a template does exist so you might be right that it will be everywhere eventually.

Oh, I see. Thank you for pointing that out and finding the actual template.

adfaure avatar Jul 02 '24 22:07 adfaure

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2024-summer-of-nix-program-updates/46053/3

nixos-discourse avatar Jul 08 '24 17:07 nixos-discourse

I could pick this up again and run it through to completion.

mightyiam avatar Dec 03 '24 15:12 mightyiam

Hello everyone, let's wrap this up.

TLDR:

  • You can build all with: NIXPKGS_ALLOW_BROKEN=1 NIXPKGS_ALLOW_BROKEN=1 nix-build -A python3Packages.tree-sitter-grammars -k.
  • I've addressed most of the straightforward comments (e.g., adding comments and minor fixes).
  • The generated code may look messy, but it's the best way to make all grammars in nixpkgs work.
  • I agree testing is an issue; the provided tests give some information (e.g., it failed for tree_sitter_nvim_org) but it is not much.
  • I've tried different approaches, and the best solution seems to be enabling tests directly in the tree-sitter-grammars.

Generated Code:
With recent versions of tree-sitter, bindings are generated and incorporated into the project. For newer grammars, we could build bindings directly from the source without generating extra files. However, the tree-sitter ecosystem is sparse, and many grammars aren't up-to-date, so they lack bindings. A hybrid solution—generate bindings for old grammars and use direct source for newer ones—might work, but the time-to-benefit ratio might not justify it.

Testing:
Testing is challenging because each grammar defines its own node identifiers in the parsed source tree, making it difficult to have a generic test. The default test checks if the binding can load a language, and tree-sitter has its own testing framework, but it's in Rust and compares code snippets to s-expressions.

I explored three approaches to incorporate tests into the Python modules:

  1. Enable tree-sitter test in the grammar.nix file (The one that I choose):
    • This approach revealed broken grammars, which I flagged as broken.
    • I updated the Python script to test the grammar and automatically add the broken flag if the test fails.
  2. Use grammar tests as queries and check them against the root tree:
    • This worked somewhat, but it's unreliable because the test language (s-expression-based) isn't the same as the query language, and some queries didn't work.
  3. Parse the test language and compare the Python tree with the s-expression:
    • This approach could work but is too complex and not reliable enough. The test language has many features to implement, and sometimes the generated Python s-expression didn't match the expected s-expression, and I couldn't determine why.

Usage

To use the grammars in python, you can install them with (example with rust):

NIX_PATH=nixpkgs=https://github.com/ngi-nix/nixpkgs/archive/treesitter-python-generated-grammars.tar.gz nix-shell -p 'python3.withPackages(ps : with ps; [ tree-sitter tree-sitter-grammars.tree-sitter-rust ])' --pure

and execute the following python file python main.py:

# main.py
import tree_sitter, tree_sitter_rust

rust = tree_sitter.Language(tree_sitter_rust.language())
parser = tree_sitter.Parser(rust)

tree = parser.parse(
    bytes(
        """
mod english;

mod english {}

mod english {
    mod greetings {}
    mod farewells {}
}

pub mod english;
        """,
        "utf8"
    )
)

print(tree.root_node)

That should output something like:

(source_file (mod_item name: (identifier)) (mod_item name: (identifier) body: (declaration_list)) (mod_item name: (identifier) body: (declaration_list (mod_item name: (identifier) body: (declaration_list)) (mod_item name: (identifier) body: (declaration_list)))) (mod_item (visibility_modifier) name: (identifier)))

adfaure avatar Dec 03 '24 15:12 adfaure

Thank you @fricklerhandwerk for the review, I addressed most of your comments.

I still need to address the following:

Please add the usage example from #320783 (comment) (with a properly written out shell.nix) to the Nixpkgs manual.

I'll do that shortly, I just need to locate it first.

Don't we want to derive that from the Python we use for building? Or if it's fixed to 3.8 for some reason, at least use the same source for it everywhere it's referenced. See all the other hard-coded values.

Regarding Py_LIMITED_API, I don’t have the right answer yet. I need to think about it a bit more.

adfaure avatar Dec 07 '24 14:12 adfaure

I think this is good and will merge it in the next few days. If anyone has complaints after that, we're here to fix forward.

fricklerhandwerk avatar Mar 30 '25 10:03 fricklerhandwerk

Congratulations!

mightyiam avatar Apr 02 '25 07:04 mightyiam

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/ngi-nix-report-for-2025q1-2025-01-13-2025-04-15/63859/1

nixos-discourse avatar May 05 '25 00:05 nixos-discourse