py-tree-sitter icon indicating copy to clipboard operation
py-tree-sitter copied to clipboard

Segfault when sorting captures by `start_point`

Open Rot127 opened this issue 7 months ago • 10 comments

It is possible to crash tree-sitter when many captures are sorted.

Valgrind output;:

==7579== Invalid read of size 4
==7579==    at 0x75BA4D8: ts_query_end_byte_for_pattern (query.c:2887)
==7579==    by 0x75A54D4: query_end_byte_for_pattern (query.c:699)
==7579==    by 0x4A5B48D: method_vectorcall_VARARGS (descrobject.c:324)
==7579==    by 0x49A9DD6: UnknownInlinedFun (pycore_call.h:168)
==7579==    by 0x49A9DD6: PyObject_Vectorcall (call.c:327)
==7579==    by 0x49BA14E: _PyEval_EvalFrameDefault (generated_cases.c.h:1843)
==7579==    by 0x4A146B0: UnknownInlinedFun (pycore_ceval.h:119)
==7579==    by 0x4A146B0: UnknownInlinedFun (ceval.c:1816)
==7579==    by 0x4A146B0: UnknownInlinedFun (call.c:413)
==7579==    by 0x4A146B0: UnknownInlinedFun (pycore_call.h:168)
==7579==    by 0x4A146B0: method_vectorcall (classobject.c:62)
==7579==    by 0x4A99B37: UnknownInlinedFun (call.c:285)
==7579==    by 0x4A99B37: _PyObject_Call (call.c:348)
==7579==    by 0x49BE5F2: UnknownInlinedFun (call.c:373)
==7579==    by 0x49BE5F2: UnknownInlinedFun (call.c:381)
==7579==    by 0x49BE5F2: _PyEval_EvalFrameDefault (generated_cases.c.h:1355)
==7579==    by 0x4A8C03A: PyEval_EvalCode (ceval.c:604)
==7579==    by 0x4ACAD22: run_eval_code_obj (pythonrun.c:1381)
==7579==    by 0x4AC8342: run_mod (pythonrun.c:1466)
==7579==    by 0x4AC4DD5: pyrun_file (pythonrun.c:1295)
==7579==  Address 0x621e33c is 39,372 bytes inside an unallocated block of size 145,536 in arena "client"
==7579== 

The code which triggers it:

def query_captures_22_3(query: Query, node: Node) -> list[tuple[Node, str]]:
    result = list()
    captures = query.captures(node)
    captures_sorted = dict()

    # Commenting out these lines will prevent the segfault
    # FAULTY BEGIN
    nodes: list[Node]
    for name, nodes in captures.items():
        captures_sorted[name] = sorted(nodes, key=lambda n: n.start_point)
    # FAULTY END

    while len(captures_sorted) != 0:
        for name, nodes in captures_sorted.items():
            node = nodes.pop(0)
            result.append((node, name))
        captures_sorted = {k: l for k, l in captures_sorted.items() if len(l) != 0}
    return result

Rot127 avatar May 16 '25 16:05 Rot127

It is somewhat random. It segfaults most of the time but not always. Something like 8/10.

Rot127 avatar May 16 '25 17:05 Rot127

Number of nodes sorted in the captures dict are up to 555. But when it segfaults it always does after sorting the same captures and after the function returned.

For whatever reason I can't reproduce it when the script is run in pycharm. Only from the command line.

Rot127 avatar May 16 '25 17:05 Rot127

@ObserverOfTime Is there any chance you find time fixing this soon? If not it is ok, but I'd like to plan for a work-around then. Because the v22 version no longer builds with Python3.13.

Rot127 avatar Jun 03 '25 13:06 Rot127

That version is not supported. Does the crash occur in the latest version and/or master branch?

ObserverOfTime avatar Jun 03 '25 14:06 ObserverOfTime

That version is not supported. Does the crash occur in the latest version and/or master branch?

The crash occurs with the latest release (24.0) and is used with Python ~~3.13~~ 3.12

Rot127 avatar Jun 03 '25 15:06 Rot127

Can you share a file and query that results in the crash?

ObserverOfTime avatar Jun 04 '25 08:06 ObserverOfTime

Sorry, should have given you a minimal reproducible example all along. It will take a little to isolate the code from our tool. Will report back soon.

Rot127 avatar Jun 06 '25 11:06 Rot127

Done: https://github.com/Rot127/ts-py-debug Note that it only crashes reliably on 3.12. Not on 3.13 as I said before. I had the wrong venv enabled I think.

Rot127 avatar Jun 06 '25 16:06 Rot127

@ObserverOfTime have you had a chance to look at this issue?

notxvilka avatar Jun 19 '25 16:06 notxvilka

@Rot127 The issue is that you're using the byte offset instead of the pattern index. https://github.com/Rot127/ts-py-debug/blob/main/patches/StreamOperation.py#L118

Fixed the crash by raising an IndexError if the supplied number exceeds the pattern count.

ObserverOfTime avatar Jun 19 '25 17:06 ObserverOfTime

@ObserverOfTime Thanks a lot!

Rot127 avatar Jun 22 '25 12:06 Rot127