tree-sitter icon indicating copy to clipboard operation
tree-sitter copied to clipboard

predicate/directive evaluation in runtime library

Open mattmassicotte opened this issue 2 years ago • 12 comments

Predicates and directives are a critical part of many types of queries. Right now, they are left up to the language bindings. Aside from the obvious duplication of effort, I've also noticed some deviation on usage, meaning and structure of directives in particular. This really hurts the ability to standardize query definitions, which I believe could really improve the developer experience.

I'd like to propose augmenting the runtime library to support evaluation of predicates and directives directly. I have some idea of an API, informed by my work adding this to the Swift bindings. But before I get that far, I'd like to just check in. Is this something you are interested in seeing?

mattmassicotte avatar Jan 17 '23 18:01 mattmassicotte

The problem with supporting predicates in the runtime library is that many predicates require access to a source test to make sub strings comparison while the query runtime just operates on a pure CST tree. But I think it's doable by adding an additional query functions that would receive a text provider function just like parser functions and do predicates evaluation in the core query runtime, this may even do more effective queries evaluation by skipping some sub trees traversal based on predicate conditions.

ahlinc avatar Jan 17 '23 20:01 ahlinc

The problem with this is that one of the commonly-used of predicates is #match?, which needs a regex engine to evaluate. Most high-level languages already have good regex libraries that are either part of the standard library (JavaScript, Go, Python, Swift) or very commonly-used libraries, so that most downstream applications will already depend on them (Rust). But in the C library, we don't have reliable, cross-platform access to a regex implementation.

maxbrunsfeld avatar Jan 17 '23 21:01 maxbrunsfeld

It is indeed a pain to need access to the text at query execution time, as @ahlinc says. But a text provider, just like the parser needs, can address that. It's not really too bad. What is bad is the need to ensure the text content and tree match. Even though the tree API supports concurrent well, this really hurts the overall concurrency story.

I've found that the majority of real-world uses of #match are actually checking case, or comparing against a static list. Both of these uses are much easier and more performant with simpler predicates like #any-of? or #capitalized?. But, I'm sure there are uses that these won't cover. I'd propose implementing a simple function argument that can handle #match. So, I recognize that #match cannot be supported directly, but I think that's ok.

The #is-not? predicate, however, is way worse. I'm not 100% sure how to handle this case, but I think it can also be done with a callback.

Now, not to derail the conversation too much, but I would absolutely love to be able to feed the query definitions themselves into the code generation system. I use tree-sitter as a query-execution tool, and I would imagine most use it this way too. Statically defining the queries, and encoding the results of predicates into tree could avoid the need to look at the content twice. Besides being neat, it would actually enable concurrency support for queries, which I think is currently challenging when supporting edits while a query is being evaluated in the background. I'd also love to save on query preparation time, which is significant. But, I recognize this is a non-trivial thing to do, so I just throwing it out here.

Now, even with these limitations, I still think there's a big win to be had here.

mattmassicotte avatar Jan 17 '23 21:01 mattmassicotte

I've noticed that #lua-match? is common in highlights written with the nvm-treesitter project in mind. When I consume these query predicates they are just dropped, leading to more things being highlighted than they should.

Runtime access to the predicates would let me check the regexes with an appropriate library. Alternatively, could it be an optional feature of the Rust library to depend on regex?

Wilfred avatar Feb 10 '23 16:02 Wilfred

There already is runtime access to the predicates from Rust, so you could process lua-match yourself using rust regexes (if the same regex syntax is supported)

maxbrunsfeld avatar Feb 10 '23 16:02 maxbrunsfeld

Just as an idea that came up to me. It would be possible to split the tree-sitter.so to a series of libraries:

  • tree-sitter.so - a pure algorithmic library without any external deps (so even allocator funcs need to be provided for it, currenly they can be replaced with ts_set_allocator)
  • tree-sitter-predicates.so - a library that compiled with a Rust regex crate and provides query predicates processing, so for end users it would look like predicates would be processed in Tree-sitter core. This also would allow to provide predicates support by default for a lot of targets and not only for Rust and Wasm.
  • tree-sitter-utils.so - provides for tree-sitter.so a default implementation of memory allocation functions and all other platform dependent things. So the ts_node_string() in the main tree-sitter.so would return nothing until a correct wrappers for *printf* functions would be provided for some way.

ahlinc avatar Feb 10 '23 18:02 ahlinc

You could absolutely do this!

However, I still think there is tremendous benefit to the tree-sitter community to just internalizing predicate handling. I have absolutely no problem with it being a standalone library. But, my preference would still be to allow regex handling to be injected via a function. This would make it slightly less-trivial to support than embedding the Rust implementation, but would offer greater flexibility to environments that do not need a regex engine.

Additionally, while I imagine that #lua-match was probably made to handle some very specific differences, I have found many uses of it that can be replaced by a ICU-style #match. I think it's just getting cargo-culted sometimes. Not to mention that many times I've seen #match used, it can be replaced by a much simpler construct.

mattmassicotte avatar Feb 10 '23 18:02 mattmassicotte

@mattmassicotte on a previous message @maxbrunsfeld reasonably pointed that it's impossible to implement all currently supported predicates like #match? without integrating with some good regex library. In the C ecosystem there is no such reliable library. Above I suggested a way how we can join with Rust's regex crate. The above structure needed to don't close possibility to compile and use Tree-sitter on environments that isn't supported by Rust compiler and don't make Rust compiler a default requirement for the base Tree-sitter C library. For Tree-sitter consumers that use language bindings or the C library in current form nothing would change in API with changes proposed above.

ahlinc avatar Feb 10 '23 19:02 ahlinc

@mattmassicotte on a previous message @maxbrunsfeld reasonably pointed that it's impossible to implement all currently supported predicates like #match? without integrating with some good regex library. In the C ecosystem there is no such reliable library. Above I suggested a way how we can join with Rust's regex crate. The above structure needed to don't close possibility to compile and use Tree-sitter on environments that isn't supported by Rust compiler and don't make Rust compiler a default requirement for the base Tree-sitter C library. For Tree-sitter consumers that use language bindings or the C library in current form nothing would change in API with changes proposed above.

Yes, sorry I should have re-iterated my proposed solution. I'd like to expose a match handler function from the predicate system. It can be used to defer the actual regular expression evaluation to the client. It's true, this isn't quite as nice as a fully packaged up solution. But, it's very simple and doesn't require bundling in any regex engines. And, all clients that do this now clearly have that anyways.

Additionally, I'm trying to make the case that many uses of #lua-match and #match can be replaced by non-regex-based operators anyways. In fact, the neovim-tree-sitter documentation already specifically mentions favoring the #any-of predicate where possible. This matches my (admittedly limited) experience that regex predicates are frequently used for trivial conditions.

mattmassicotte avatar Feb 10 '23 21:02 mattmassicotte

@mattmassicotte I've got your point, I agree, it would be possible to support processing in the core library all pure string predicates that actually don't contain any regex specifics in their definitions and only all unprocessed predicates would be raised to a caller scope.

ahlinc avatar Feb 10 '23 22:02 ahlinc

@mattmassicotte If you want to evaluate #match predicates via an arbitrary regex engine, the library already exposes the primitives needed for doing that: you can retrieve the predicates and their arguments from the query using ts_query_predicates_for_pattern. So I don't think that adding specific C APIs specifically for #match predicates makes sense.

maxbrunsfeld avatar Feb 10 '23 22:02 maxbrunsfeld

The problem with this is that one of the commonly-used of predicates is #match?, which needs a regex engine to evaluate. Most high-level languages already have good regex libraries that are either part of the standard library (JavaScript, Go, Python, Swift) or very commonly-used libraries, so that most downstream applications will already depend on them (Rust). But in the C library, we don't have reliable, cross-platform access to a regex implementation.

I was really suggesting this as a lightweight solution to the issue you brought up here. But, if you prefer bundling up the Rust regex library I'm all for it. I'm not proposing adding a C API for #match. I'm proposing C APIs to allow the runtime to correctly evaluate real queries, along with a system to inject the regex dependency because that bit is annoying, as you say.

Today, clients have to implement all predicate handling and filtering. I've done this in the Swift client, using the existing C APIs, and I found it to be non-trivial.

mattmassicotte avatar Feb 10 '23 23:02 mattmassicotte