tree-sitter-c icon indicating copy to clipboard operation
tree-sitter-c copied to clipboard

Cyclic path in the parse tree

Open casouri opened this issue 3 years ago • 6 comments

I encountered a case where a node's sibling's parent is the node itself. I don't know if this is a tree-sitter bug or a tree-sitter-c bug, but here it is:

For a program like this:

#ifdef A
struct a{
};

the parse tree is

(preproc_ifdef #ifdef name: (identifier)
 (struct_specifier struct name: (type_identifier)
  body: (field_declaration_list { }))
 ; #endif)

And if we go from the semicolon, first find its next sibling, which is the #endif, then the parent of #endif, and we are back at the semicolon.

To reproduce, try the following python script. (libtree-sitter-c.dylib needs to be in the same directory as the script.) I'm using tree-sitter-c built from latest master, and python's tree-sitter binding on PyPI, so 0.20.1. I first observed this in Emacs, and it uses tree-sitter v0.20.7.

from tree_sitter import Language, Parser
import tree_sitter

LANG = Language('libtree-sitter-c.dylib', 'c')
parser = Parser()
parser.set_language(LANG)

text = '''#ifdef A
struct a{
};
'''

tree = parser.parse(bytes(text, "utf8"))

root_node = tree.root_node

query = LANG.query('";" @cap')

captures = query.captures(tree.root_node)

semicolon = captures[0][0]

semicolon.next_sibling.parent == semicolon

which evaluates to True.

casouri avatar Dec 13 '22 23:12 casouri

I think this is a bug in the ts_node_parent method, that occurs when a node has zero extent (such as the missing #endif token in this tree). It's pretty bad; it'll require making ts_node_parent a bit more complex to fix.

For now, my suggested workaround is to not use the ts_node APIs that walk up the syntax tree (the sibling and parent APIs). They have never been very efficient anyway, and for code that needs to be performant, you should walk the tree using a tree cursor.

I've also considered simply removing the parent and sibling APIs, so that the tree cursor would be the only way to walk the tree in an upward fashion.

maxbrunsfeld avatar Dec 14 '22 01:12 maxbrunsfeld

Thanks. Pardon my ignorance, but I wonder how could you traverse the parse tree backwards with a cursor? That would be useful for, eg, searching for a function_definition node before the cursor.

BTW, just to make sure I'm not missing anything, if I want to traverse the parse tree and find a function_definition node, I would write something like this, right? Ie, you need to create a node with ts_tree_cursor_current_node in every step.

TSNode node;
...traverse in a loop {
  node = ts_tree_cursor_current_node(cursor)
  (check for node's type)
}

casouri avatar Dec 14 '22 06:12 casouri

Yes, exactly. The tree cursor lets you retrieve nodes as you walk the tree.

maxbrunsfeld avatar Dec 14 '22 07:12 maxbrunsfeld

Thanks, but how about traversing backwards in the tree with a cursor? That would be useful for, eg, searching for a function_definition node before the cursor.

casouri avatar Dec 14 '22 08:12 casouri

Ok, I looked at the source of ts_node_prev_sibling, it seems the only way to traverse backwards is to traverse forwards and stop at the node just before the starting node. I don't have any questions now. Thanks :-)

casouri avatar Dec 14 '22 20:12 casouri

In at least one context — emacs with python-ts-mode, navigating from a node at point up to root — ts_node_parent is quite a bit faster and scales much better (logarithmic with line number) than ts_tree_cursor_goto_parent, the latter growing in time used by over 100x from beginning to end of a large (>8k line) file.

Test results here. Perhaps the current emacs 29 implementation of node-parent using cursors is sub-optimal in some way?

jdtsmith avatar Aug 19 '23 14:08 jdtsmith