py-tree-sitter
py-tree-sitter copied to clipboard
Add visitors to the Tree-sitter Python API
Feature request: visitor pattern
This PR represents a feature request for the visitor pattern to be included in the Tree-sitter Python API. The visitor pattern is well-established method for processing large hierarchical object structures including abstract syntax trees.
While this use case is essentially covered by TreeCursor
, visitors can provide an convenient method for traversing
the whole tree structure. In addition, this PR would provide a native interface for resolving issues like #32 and #33.
Here is a summary of the changes implemented by this PR:
- Add new module
visitor
which includes the implementation ofASTVisitor
-
ASTVisitor
implement the visitor pattern based on theTreeCursor
API - The visitor conveniently traverses the tree in pre-order traversal. The implementation has minimal overhead and is inline with previous implementation (e.g. the one found in this comment).
- New visitor implementation can subclass
ASTVisitor
. Nodes are handled by defining new methodsvisit_[Node type]
where Node type is the type of the node that should be handled.
This implementation has be proven to be really valuable in our recent research projects. Therefore, I like to contribute this to the official Tree-sitter Python API. In addition, I think that visitors could be valuable alternative for interfacing the parsed AST trees (especially when the complete tree has to be traversed).
Thanks for the contribution!
-
Would you be able to write a test for this?
-
I'm not sure about the
_node_equal
method. Can you speak to cases where node equality fails? See also for example this issue I filed with tree sitter, it seems like it's possible for grammars to have zero width nodes, which implies that comparison by text position is wrong: https://github.com/tree-sitter/tree-sitter/issues/1647
-
The PR is now updated to include tests for the visitor's core functionalities.
-
I found that the node equality check with the current cursor implementation is unnecessary. It seems that potential loops introduced by missing nodes are directly handled by
TreeCursor
. Therefore, I removed the related code.
Regarding the node equality (Only FYI, irrelevant now)
In an earlier implementation, the node equality check failed sometimes between an ERROR node and another node (albeit this behavior is inconsistent). Here an example from the log:
Node A, Node B, A == B
<Node kind="}", start_point=(2, 1), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, False
<Node kind=ERROR, start_point=(1, 747), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, False
<Node kind="}", start_point=(2, 1), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, False
<Node kind=ERROR, start_point=(1, 747), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, True
However, I cannot reproduce this behavior with the current implementation (or current version of Tree-sitter).
To me, this seems like an unnecessarily complicated API, for how little actual functionality is being added. I don't like that you need to use inheritance to use this API. I think that if we want to provide an abstraction around the depth-first traversal, I'd rather see that added in a simpler way, like a simple method on a TreeCursor
that returns an iterator.
cursor = tree.walk()
for node in cursor.depth_first_nodes():
print node
I think that is a bit simpler and less prescriptive. @lunixbochs @cedricrupb What do you think of that version?
I've been trying to think about the API here.
I think using return False
for preventing subtree traversal stops one of the useful cases for tree visitors, which is to construct another tree recursively by propagating the return values upward.
The Visitor class pattern does have precedent in Python's stdlib ast.NodeVisitor and popular libraries like Lark
cursor.depth_first_nodes()
seems pretty convenient but doesn't allow the "skip this subtree" approach, and still doesn't make it very convenient to incrementally construct another tree in the process of visiting.
If users upgrade to python 3.10, they do get Structural Pattern Matching, which is probably better than the implicit getattr(visit_name)
approach:
cursor = tree.walk()
for node in cursor.depth_first_nodes():
match node.type:
case "string":
...
I think there's probably room for additional convenience methods on cursor / tree.
Generally, I like the idea of having an easy way to iterate through all nodes.
However, I think that a visitor pattern could add additional flexibility which cannot be easily achieved with a simple iterator
.
For example:
- A visitor can carry state during traversal:
class WriteVariableFinder(ASTVisitor):
"""A very primitive implementation to find variables in write context"""
def __init__(self):
self._in_write_ctx = False
def visit_assignment(self, assignment_node):
self._in_write_ctx = True
self.walk(assignment_node.child_by_field_name("left"))
self._in_write_ctx = False
return False
def visit_identifier(self, id_node):
if self._in_write_ctx:
print(f"Maybe found variable in write context: {id_node}")
- A visitor can be used for a shallow search:
class FindFunctionNode(ASTVisitor):
def __init__(self):
self.functions = []
def visit_function_definition(self, node):
self.functions.append(node)
return False # Stop here and do not traverse function
These are some use cases which made visitors really helpful in the past.
I agree skipping subtrees is interesting. Let's assume there's a way to signal that, e.g. a cursor.skip_children()
method. Now your Visitor class mostly reduces to something like this:
class Visitor:
def visit_default(self, node):
pass
def walk(self, tree):
cursor = tree.walk()
for node in cursor.depth_first_nodes():
visit = getattr(self, 'visit_' + node.type, self.visit_default)
if visit(node) is False: cursor.skip_children()
and FindFunctionNode can be done like this:
cursor = tree.walk()
for node in cursor.depth_first_nodes():
if node.type == "function_definition":
functions.append(node)
cursor.skip_children()
I am not 100% sure how depth_first_nodes
would operate.
Would it modify/operate on the state of the cursor? In this case, I could imagine that skip_children
could be implemented by a combination of goto_next_sibling
and goto_parent
.
However, if the iterator is influenced by the state of the cursor, this could lead to unwanted side effects. For example:
for node in cursor.depth_first_nodes():
cursor.goto_parent()
This would lead into an infinite loop.
If the complexity of creating a visitor is the main point, I could alternatively imagine the integration of visitors directly in the tree / cursor.
For example, the tree could be visited by providing a callable
to a visit
method:
cursor = tree.walk()
cursor.visit(lambda node: print(f"Found {node.type}"))
Then we skip subtrees when the callable returns False
.
This would:
- simplify the creation of an AST visitor in the style I proposed here and
- Add additional flexibility since the callable could be anything.
~I'm not sure if this is needed, does the newly implemented LookaheadIterator API upstream provide more enhanced functionality this sought to bring?~
I was extremely tired when I wrote that the LookaheadIterator stuff is different than what this is doing - but I still don't particularly like this, I don't see the point. Is it just to auto-walk the entire tree calling callbacks to be invoked on each visit? I feel that this is too generic a solution for a problem that should be tailored more carefully on an individual basis
We're not interested in implementing this API here. You can publish it as a separate (py-tree-sitter-visitors) package instead.