py-tree-sitter icon indicating copy to clipboard operation
py-tree-sitter copied to clipboard

Tree object is not serializable

Open oxeye-daniel opened this issue 3 years ago • 6 comments

There are several common use cases where the user wants to serialize a Tree object for later use. I was wondering, why aren't Tree and Node serializable? Thanks!

oxeye-daniel avatar Mar 16 '22 13:03 oxeye-daniel

serialize how, exactly?

lunixbochs avatar Mar 31 '22 15:03 lunixbochs

Trees are large, so it’s better to reparse them from source code than to serialize them directly.

maxbrunsfeld avatar Mar 31 '22 16:03 maxbrunsfeld

serialize how, exactly?

For example, one popular options could be to and from JSON.

oxeye-daniel avatar Apr 03 '22 07:04 oxeye-daniel

Trees are large, so it’s better to reparse them from source code than to serialize them directly.

I understand trees are large, however, there is a very common use case of having the ability to load a tree from the disk, when the source code is not fully available. Other languages and frameworks have native support for that, for example, you can serialize (pickle/json) a LibCST tree (https://github.com/Instagram/LibCST) or an AST tree (using the ast module).

oxeye-daniel avatar Apr 03 '22 07:04 oxeye-daniel

The Tree object or Node serialization in some cases is very important! For example if you need to treat many code sources you are thinking about using multiprocessing. But for using multiprocessing processing object need to be pickable (pickle serialization supporting). At the moment is not possible to use py treesitter with multiprocessing as is.

MLDPEngineer avatar Apr 19 '22 08:04 MLDPEngineer

There are two major problems I see here:

The tree sitter C api uses opaque structs, such as TSTree, which there's no way for the Python bindings to serialize (the API just doesn't give us direct access to the contents).

The TSTree object also has a handle to the TSLanguage object, which is a totally opaque pointer provided by the language implementation.


What are the implications of this?

  1. We can't serialize a TSTree struct, because we don't have access to the internals.

  2. To use a Tree object in Python, need the language library loaded, which can't itself be pickled, and I don't think there's a great way to guarantee we can load it with pickle.

I think these problems make it unrealistic for py-tree-sitter to itself provide pickle support. Also consider that even if these problems were solved, it might actually be slower to pickle and unpickle a tree than to parse it again with tree-sitter.

LibCST doesn't have these problems, because it is written entirely for Python. With tree-sitter, we're adapting an existing ecosystem that has a lot of native code we don't control in this repository (tree-sitter itself, and every language is implemented by a different organization and has its own opaque backend).


As a sort of workaround, you could register pickle hooks in your own code. The serializer could wrap node.text in a new class. The deserializer side would need to already have an instance of the appropriate tree sitter parser, and would parse the node again from scratch.

Here's an example: https://gist.github.com/lunixbochs/b4925de38c4930045e088cc86e887be1

This works for both trees and nodes, but as you can see the nodes will change subtly as they weren't parsed from a full document. To create the new nodes in an identical manner, you'd need to pickle an entire tree's source text, parse it into a new tree, and extract just one node from it.


To sum up, py-tree-sitter can't provide pickling of the underlying native objects, because it doesn't have access to some of the internal state (both tree-sitter and the parsed language are independently opaque), and because some of the (language specific) objects are implemented by runtime-loaded native code and you can't really pickle the native code. We can't fix either of these issues in the py-tree-sitter repository.

You could make a feature request against upstream tree-sitter, but the TSLanguage pointer is opaque to tree-sitter as well so it would probably require quite an overhaul of the whole ecosystem. I think a naive overhaul may be a bad idea as well, as the serialization format would depend on a lot of internal state in tree sitter and likely be unstable across different versions of tree sitter and language parsers.

I think a simpler upstream change would theoretically be to allow parsing text with a fake starting scope, file offset, and node ID, so the re-parsed nodes would more closely match. That would improve the behavior of my existing pickle example with Nodes.


You may be able to work around the lack of pickle yourself using my provided example, depending on your workload and how closely your tree sitter objects need to match each other.

Alternatively, you can architect your app to avoid sending tree sitter objects between your multiprocess workers - use an initializer function to create a new parser / language per worker, extract whatever fields you need from a node, and wrap those in a standalone dataclass that doesn't depend on tree-sitter.

lunixbochs avatar Apr 19 '22 18:04 lunixbochs

Serialisation is not available in any of the bindings and pickles are bad.

ObserverOfTime avatar Feb 26 '24 11:02 ObserverOfTime