cpython icon indicating copy to clipboard operation
cpython copied to clipboard

GC in free-threaded build has problems with `Py_INCREF()`/`Py_DECREF()` in `tp_traverse` handlers

Open colesbury opened this issue 1 year ago • 9 comments

Bug report

This came up in the context of nanobind. Nanobind implements (in a test) a traverse function:

int funcwrapper_tp_traverse(PyObject *self, visitproc visit, void *arg) {
    FuncWrapper *w = nb::inst_ptr<FuncWrapper>(self);

    nb::object f = nb::cast(w->f, nb::rv_policy::none);
    Py_VISIT(f.ptr());

    return 0;
};

The nb::object smart pointer is internally reference counted. In other words, the above is roughly equivalent to:

int funcwrapper_tp_traverse(PyObject *self, visitproc visit, void *arg) {
    PyObject *f = self->w->f;
    Py_INCREF(f);
    Py_VISIT(f);
    Py_DECREF(f);
    return 0;
};

This leads to a leak in the free-threaded GC for subtle reasons: when determining resurrected objects, the free-threaded GC uses ob_ref_local to compute the refcount - incoming references, which may be (temporarily) negative. In this case, Py_INCREF() adds 1 to the refcount, but by the time Py_DECREF() is called, the local refcount is -1 which makes the object appear immortal.

There are a number of limitations on the implementations of traverse functions, which are not well documented. For example, it's not safe to allocate, free, track, or untrack Python objects. It's unclear to me whether there are other issues with calling refcounting functions in traverse callbacks.

I think we can make handle_resurrected_objects more robust to this by splitting the first pass over state->unreachable into two passes.

See also: https://github.com/PyO3/pyo3/issues/3165, which was not related to the free-threaded build.

Linked PRs

  • gh-142232
  • gh-142271
  • gh-142272
  • gh-142422
  • gh-142423
  • gh-142567

colesbury avatar Aug 22 '24 19:08 colesbury

There are two things here to address:

  • Better documenting the current situation for all kinds of builds.
  • Try to be more resilient or maybe warn if any of this happens.

I can try to give a go to both of these. For (2) we can potentially "lock" the GC into a state that emits a warning of some sort if the GC heads are modified during traversals or if any GC function that we don't expect it's called. WDYT?

pablogsal avatar Aug 22 '24 20:08 pablogsal

I think we can make handle_resurrected_objects more robust to this by splitting the first pass over state->unreachable into two passes.

Can you elaborate on this to double check that we are in the same page?

pablogsal avatar Aug 22 '24 20:08 pablogsal

The pass over the unreachable (but possibly resurrected) objects computes the incoming references in ob_ref_local. The final value is always >= 0, but it can be temporarily negative. (Well, sort of "negative". It's a uint32_t so it wraps around.) This is related to the problem seen by nanobind because -1 (or UINT32_MAX) looks like "immortal".

https://github.com/python/cpython/blob/297f2e093ec95800ae2184330b8408c875523467/Python/gc_free_threading.c#L1004-L1013

We can change this so that the first pass initializes ob_ref_local to the current refcount (less one because the GC in this case holds a strong reference). The second pass than calls the traverse handlers. In this structure, the traverse handlers never see a ob_ref_local that looks like an immortal object.

I am still not convinced that it's safe (at least in the free-threaded build) to call Py_INCREF/DECREF during GC traversal, but this at least would fix the leak in nanobind. (I'd characterize it as more robust, rather than "safe".)

        ...
        // Subtract one to account for the reference from the worklist.
        op->ob_ref_local += (uint32_t)refcount - 1;
   }

   WORKSTACK_FOR_EACH(&state->unreachable, op) {
        traverseproc traverse = Py_TYPE(op)->tp_traverse;
        (void) traverse(op,
            (visitproc)visit_decref_unreachable,
            NULL);
    }

colesbury avatar Aug 22 '24 21:08 colesbury

I think for the general issue I would like to:

  • Raise or warn if any of the public GC APIs are used while we are calling the traverse functions.
  • (Maybe only in debug mode) use the double pass to detect extensions changing reference counts by comparing the gathered ref counts in the first pass against a mixed pass. If these two don't match then raise or warn.

What are your thoughts?

pablogsal avatar Aug 22 '24 21:08 pablogsal

Raise or warn if any of the public GC APIs are used while we are calling the traverse functions

To be clear; the problem in this instance was the use of Py_INCREF/Py_DECREF rather than things like PyObject_GC_Track, although those would definitely be problematic in a tp_traverse.

It seems fine to add extra checks in debug builds, but I'd be worried about introduce slow downs in optimized builds for many of these functions. We probably would also need to be careful about how we report warnings and errors. I think creating a Python warning or error object during a tp_traverse handler execution may cause the same sort of problems we're trying to avoid.

I'm also hesitant to rush any fixes. The change to nanobind seems rather straightforward. And @ngoldbaum has run into something that might be related in PyO3 -- I don't fully understand that problem yet.

colesbury avatar Aug 22 '24 22:08 colesbury

I'm also hesitant to rush any fixes

I share the same concerns so don't worry. In any case all of these proposal would be only on debug builds and is just a way to warn about the problem. I don't think we need to "fix" anything other than the docs to specify the contract. Any other thing is going to add slowdowns or extra complexity in release builds

pablogsal avatar Aug 22 '24 22:08 pablogsal

For what it's worth the PyO3 issue was indeed a refcounting op hidden behind a smart pointer (credit @colesbury for identifying that at the time). https://github.com/PyO3/pyo3/pull/4479

davidhewitt avatar Feb 06 '25 13:02 davidhewitt

This means that tp_traverse must not fail (because it can't set an exception), right?

Does tp_clear have similar requirements?

encukou avatar Dec 16 '25 16:12 encukou

I think "the traversal function must not have any side effects" means that tp_traverse cannot itself fail or set an exception. However, the visitor passed to tp_traverse can fail and that failure is passed up via Py_VISIT. For example, the gc.get_referrers visitor appends to a list, which can fail with an out of memory error.

In other words, the GC is in control of whether the tp_traverse call can fail and, in some places, the GC assumes the call must not fail. (This is true in both the FT and default build GCs -- look for (void) traverse)

tp_clear can return -1 and it can set an exception. However, the return code has always been ignored (even in Python 2.0). Any exception is printed as unraisable and cleared.

I think it generally makes more sense for the tp_clear implementation to handle it's own internal exceptions (like tp_dealloc implementations are required to), but it doesn't really matter.

colesbury avatar Dec 16 '25 17:12 colesbury

A topic for adding API that would be safe to use in tp_traverse: https://discuss.python.org/t/105331

encukou avatar Dec 17 '25 13:12 encukou

Related: #142981

davidhewitt avatar Dec 19 '25 15:12 davidhewitt