py-tree-sitter icon indicating copy to clipboard operation
py-tree-sitter copied to clipboard

Add visitors to the Tree-sitter Python API

Open cedricrupb opened this issue 3 years ago • 8 comments

Feature request: visitor pattern

This PR represents a feature request for the visitor pattern to be included in the Tree-sitter Python API. The visitor pattern is well-established method for processing large hierarchical object structures including abstract syntax trees.

While this use case is essentially covered by TreeCursor, visitors can provide an convenient method for traversing the whole tree structure. In addition, this PR would provide a native interface for resolving issues like #32 and #33.

Here is a summary of the changes implemented by this PR:

  • Add new module visitor which includes the implementation of ASTVisitor
  • ASTVisitor implement the visitor pattern based on the TreeCursor API
  • The visitor conveniently traverses the tree in pre-order traversal. The implementation has minimal overhead and is inline with previous implementation (e.g. the one found in this comment).
  • New visitor implementation can subclass ASTVisitor. Nodes are handled by defining new methods visit_[Node type] where Node type is the type of the node that should be handled.

This implementation has be proven to be really valuable in our recent research projects. Therefore, I like to contribute this to the official Tree-sitter Python API. In addition, I think that visitors could be valuable alternative for interfacing the parsed AST trees (especially when the complete tree has to be traversed).

cedricrupb avatar Feb 25 '22 13:02 cedricrupb

Thanks for the contribution!

  1. Would you be able to write a test for this?

  2. I'm not sure about the _node_equal method. Can you speak to cases where node equality fails? See also for example this issue I filed with tree sitter, it seems like it's possible for grammars to have zero width nodes, which implies that comparison by text position is wrong: https://github.com/tree-sitter/tree-sitter/issues/1647

lunixbochs avatar Feb 25 '22 20:02 lunixbochs

  1. The PR is now updated to include tests for the visitor's core functionalities.

  2. I found that the node equality check with the current cursor implementation is unnecessary. It seems that potential loops introduced by missing nodes are directly handled by TreeCursor. Therefore, I removed the related code.

Regarding the node equality (Only FYI, irrelevant now)

In an earlier implementation, the node equality check failed sometimes between an ERROR node and another node (albeit this behavior is inconsistent). Here an example from the log:

Node A, Node B, A == B
<Node kind="}", start_point=(2, 1), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>,  False
<Node kind=ERROR, start_point=(1, 747), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, False
<Node kind="}", start_point=(2, 1), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, False
<Node kind=ERROR, start_point=(1, 747), end_point=(2, 2)>, <Node kind="}", start_point=(2, 2), end_point=(2, 2)>, True

However, I cannot reproduce this behavior with the current implementation (or current version of Tree-sitter).

cedricrupb avatar Feb 28 '22 18:02 cedricrupb

To me, this seems like an unnecessarily complicated API, for how little actual functionality is being added. I don't like that you need to use inheritance to use this API. I think that if we want to provide an abstraction around the depth-first traversal, I'd rather see that added in a simpler way, like a simple method on a TreeCursor that returns an iterator.

cursor = tree.walk()
for node in cursor.depth_first_nodes():
    print node

I think that is a bit simpler and less prescriptive. @lunixbochs @cedricrupb What do you think of that version?

maxbrunsfeld avatar Feb 28 '22 19:02 maxbrunsfeld

I've been trying to think about the API here.

I think using return False for preventing subtree traversal stops one of the useful cases for tree visitors, which is to construct another tree recursively by propagating the return values upward.

The Visitor class pattern does have precedent in Python's stdlib ast.NodeVisitor and popular libraries like Lark

cursor.depth_first_nodes() seems pretty convenient but doesn't allow the "skip this subtree" approach, and still doesn't make it very convenient to incrementally construct another tree in the process of visiting.

If users upgrade to python 3.10, they do get Structural Pattern Matching, which is probably better than the implicit getattr(visit_name) approach:

cursor = tree.walk()
for node in cursor.depth_first_nodes():
    match node.type:
        case "string":
            ...

I think there's probably room for additional convenience methods on cursor / tree.

lunixbochs avatar Feb 28 '22 20:02 lunixbochs

Generally, I like the idea of having an easy way to iterate through all nodes.

However, I think that a visitor pattern could add additional flexibility which cannot be easily achieved with a simple iterator.

For example:

  1. A visitor can carry state during traversal:

class WriteVariableFinder(ASTVisitor):
     """A very primitive implementation to find variables in write context"""

    def __init__(self):
        self._in_write_ctx = False
    
    def visit_assignment(self, assignment_node):
      self._in_write_ctx = True
      self.walk(assignment_node.child_by_field_name("left"))
      self._in_write_ctx = False
      return False
     
    def visit_identifier(self, id_node):
       if self._in_write_ctx:
         print(f"Maybe found variable in write context: {id_node}")
           
  1. A visitor can be used for a shallow search:
class FindFunctionNode(ASTVisitor):
  def __init__(self):
    self.functions = []
           
  def visit_function_definition(self, node):
    self.functions.append(node)
    return False # Stop here and do not traverse function

These are some use cases which made visitors really helpful in the past.

cedricrupb avatar Feb 28 '22 20:02 cedricrupb

I agree skipping subtrees is interesting. Let's assume there's a way to signal that, e.g. a cursor.skip_children() method. Now your Visitor class mostly reduces to something like this:

class Visitor:
    def visit_default(self, node):
        pass

    def walk(self, tree):
        cursor = tree.walk()
        for node in cursor.depth_first_nodes():
            visit = getattr(self, 'visit_' + node.type, self.visit_default)
            if visit(node) is False: cursor.skip_children()

and FindFunctionNode can be done like this:

cursor = tree.walk()
for node in cursor.depth_first_nodes():
    if node.type == "function_definition":
        functions.append(node)
        cursor.skip_children()

lunixbochs avatar Feb 28 '22 20:02 lunixbochs

I am not 100% sure how depth_first_nodes would operate.

Would it modify/operate on the state of the cursor? In this case, I could imagine that skip_children could be implemented by a combination of goto_next_sibling and goto_parent.

However, if the iterator is influenced by the state of the cursor, this could lead to unwanted side effects. For example:

for node in cursor.depth_first_nodes():
  cursor.goto_parent()

This would lead into an infinite loop.

cedricrupb avatar Feb 28 '22 21:02 cedricrupb

If the complexity of creating a visitor is the main point, I could alternatively imagine the integration of visitors directly in the tree / cursor.

For example, the tree could be visited by providing a callable to a visit method:

cursor = tree.walk()
cursor.visit(lambda node: print(f"Found {node.type}"))

Then we skip subtrees when the callable returns False.

This would:

  1. simplify the creation of an AST visitor in the style I proposed here and
  2. Add additional flexibility since the callable could be anything.

cedricrupb avatar Feb 28 '22 21:02 cedricrupb

~I'm not sure if this is needed, does the newly implemented LookaheadIterator API upstream provide more enhanced functionality this sought to bring?~

I was extremely tired when I wrote that the LookaheadIterator stuff is different than what this is doing - but I still don't particularly like this, I don't see the point. Is it just to auto-walk the entire tree calling callbacks to be invoked on each visit? I feel that this is too generic a solution for a problem that should be tailored more carefully on an individual basis

amaanq avatar Sep 04 '23 20:09 amaanq

We're not interested in implementing this API here. You can publish it as a separate (py-tree-sitter-visitors) package instead.

ObserverOfTime avatar Feb 26 '24 10:02 ObserverOfTime