pybind11 [BUG]: `pybind11::gil_safe_call_once_and_store` is not safe for Python subinterpreters

Required prerequisites

[x] Make sure you've read the documentation. Your issue may be addressed there.
[x] Search the issue tracker and Discussions to verify that this hasn't already been reported. +1 or comment there if it has.
[ ] Consider asking first in the Gitter chat room or in a Discussion.

What version (or hash if on master) of pybind11 are you using?

3.0.1

Problem description

Overview

I am currently updating my C++ extension to support Python subinterpreters (PEP 734), utilizing Pybind11 3.x and Python 3.14+.

I have discovered that pybind11::gil_safe_call_once_and_store is fundamentally unsafe in a multi-interpreter environment. It relies on static (process-global) storage to cache Python objects. When the interpreter that initialized the static storage is destroyed, the cached pointers become invalid, leading to segmentation faults when subsequent interpreters attempt to access them.

The core issue is a lifetime mismatch between C++ static storage and Python interpreter contexts.

Technical Diagnosis

Module Imports: Imported modules (e.g., collections) are interpreter-dependent. Caching them statically means retaining a reference to a module object belonging to a specific interpreter.
Interned/Immutable Objects: Even for immutable objects (e.g., float, int, str can be shared between interpreters) like interned strings (PyUnicode_InternFromString), if the cached result is created by a subinterpreter, and that subinterpreter is destroyed, the static pointer stored by gil_safe_call_once_and_store becomes a dangling pointer.

Reproduction

The issue triggers a segmentation fault when a subinterpreter initializes the static cache and is then destroyed before another interpreter accesses it.

CI Failure/Core Dump: https://github.com/metaopt/optree/actions/runs/20019607592/job/57403715607?pr=245#step:18:266

Problematic C++ Pattern

Module imports are interpreter-dependent. The previous best practice code is invalid under the subinterpreters context. I need to re-fetch the object every time instead of having a per-process static cache.

#if defined(MYPACAKGE_HAS_SUBINTERPRETER_SUPPORT)

inline py::object get_defaultdict() {
    return py::getattr(py::module_::import("collections"), "defaultdict");
}

#else

inline const py::object &get_defaultdict() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    return storage
        .call_once_and_store_result([]() -> py::object {
            return py::getattr(py::module_::import("collections"), "defaultdict");
        })
        .get_stored();
}

#endif

Immutable objects, such as, float, int, str can be shared between interpreters. But PYBIND11_CONSTINIT static pybind11::gil_safe_call_once_and_store will cause a segmentation fault when the C++ static is initialized by the subinterpreter, not the main interpreter. The stored result of call_once is created by a subinterpreter, which may be gone when another interpreter accesses the result.

The test case that triggers the issue:

def test_import_in_subinterpreter_before_main():
    """
    Triggers segfault by initializing the C++ static cache in a subinterpreter,
    destroying that interpreter, and then accessing the cache in the main interpreter.
    """
    script = textwrap.dedent("""
        import contextlib
        import gc
        from concurrent import interpreters

        # 1. Initialize library in a subinterpreter (sets the static C++ pointer)
        subinterpreter = None
        with contextlib.closing(interpreters.create()) as subinterpreter:
            subinterpreter.exec('import optree')

        # 2. Subinterpreter dies here. The cached object in C++ is now invalid.

        # 3. Import in main interpreter tries to read the invalid static pointer -> Segfault
        import optree 

        del optree, subinterpreter
        for _ in range(10):
            gc.collect()
    """)

    check_script_in_subprocess(script, rerun=5)

def test_import_in_subinterpreters_concurrently():
    script = textwrap.dedent("""
        from concurrent.futures import InterpreterPoolExecutor, as_completed

        def check_import():
            import optree

        with InterpreterPoolExecutor(max_workers=32) as executor:
            futures = [executor.submit(check_import) for _ in range(128)]
            for future in as_completed(futures):
                future.result()
    """)
    check_script_in_subprocess(script, rerun=5)

I have resolved this in my project by disabling pybind11::gil_safe_call_once_and_store entirely when subinterpreter support is detected. Instead, I re-create objects every time they are needed to ensure they belong to the current interpreter context.

#if defined(MYPACAKGE_HAS_SUBINTERPRETER_SUPPORT)

#    define Py_Declare_ID(name)                                                                    \
        namespace {                                                                                \
        [[nodiscard]] inline PyObject *Py_ID_##name() {                                            \
            PyObject * const ptr = PyUnicode_InternFromString(#name);                              \
            if (ptr == nullptr) [[unlikely]] {                                                     \
                throw py::error_already_set();                                                     \
            }                                                                                      \
            return ptr;                                                                            \
        }                                                                                          \
        }  // namespace

#else

#    define Py_Declare_ID(name)                                                                    \
        namespace {                                                                                \
        [[nodiscard]] inline PyObject *Py_ID_##name() {                                            \
            PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<PyObject *> storage;        \
            return storage                                                                         \
                .call_once_and_store_result([]() -> PyObject * {                                   \
                    PyObject * const ptr = PyUnicode_InternFromString(#name);                      \
                    if (ptr == nullptr) [[unlikely]] {                                             \
                        throw py::error_already_set();                                             \
                    }                                                                              \
                    Py_INCREF(ptr); /* leak a reference on purpose */                              \
                    return ptr;                                                                    \
                })                                                                                 \
                .get_stored();                                                                     \
        }                                                                                          \
        }  // namespace

#endif

#define Py_Get_ID(name) (::Py_ID_##name())

Reproducible example code

Is this a regression? Put the last known working version here if it is.

Not a regression

Dec 08 '25 09:12 XuehaiPan

The custom exception translator is relying on pybind11::gil_safe_call_once_and_store:

https://github.com/pybind/pybind11/blob/1dc76208d5822e78fc8129552b4d622c78b7ce64/include/pybind11/pybind11.h#L3370-L3392

The translated Python exception type is only created once (per-process) and attached to the first seen interpreter. The interpreter may not always be the main interpreter. Because the module can be imported first in a subinterpreter. Also, the first seen interpreter can be destroyed manually if it is a subinterpreter. Other interpreters (the main and the other subs) can get invalid access due to the per-process C++ static.

For the first seen interpreter, the parent scope will assign the exception class with name as a member. Otherwise, the other interpreters can get AttributeError when accessing that member with name.

The solution is to change pybind11::gil_safe_call_once_and_store to be interpreter-dependent. Store the result as interpreter-dependent instead of having a per-process C++ static storage.

For example, the pybind11 internals are stored in the interpreter state dict.

https://github.com/pybind/pybind11/blob/228f56361016ab9e27d5ef21853542dab3e37693/include/pybind11/detail/internals.h#L565-L588

https://github.com/pybind/pybind11/blob/228f56361016ab9e27d5ef21853542dab3e37693/include/pybind11/detail/internals.h#L625-L648

Dec 08 '25 13:12 XuehaiPan

cc @b-pass @rwgk @henryiii

Dec 10 '25 05:12 XuehaiPan

cc @tkoeppe

Dec 10 '25 08:12 rwgk

This seems eminently reasonable: all erstwhile global state should become (dynamically allocated) state of the (sub)interpreter. State with "once" semantics can use something like std::call_once as a user-defined replacement for block-scope static variables.

Dec 10 '25 21:12 tkoeppe

@XuehaiPan Is there a chance that you can work on the fixes? @b-pass, could you help with guidance?

Dec 10 '25 22:12 rwgk

Yep, I suggest making it work the same as pybind11::detail::internals_pp_manager.

Keep a pointer to the value and a pointer to the owning interpreter, and if the current interpreter matches then just return the pointer (avoid re-doing the internals dict lookup)
Check get_num_interpreters_seen() to avoid doing extra work if there aren't multiple interpreters.

Dec 10 '25 23:12 b-pass

@XuehaiPan Is there a chance that you can work on the fixes?

I will work on the fixes this weekend.

Dec 11 '25 03:12 XuehaiPan

all erstwhile global state should become (dynamically allocated) state of the (sub)interpreter

I have a question about the life span of the call-once result.

Previously, the call-once result was a per-process static stored with a raw pointer. It is intentionally leaked and does not want to be managed by the Python GC. Because calling a C++ dtor is not safe during program shutdown (the main interpreter may already be finalized). The leaked memory is freed by the OS anyway on program shutdown.

However, in the incoming per-interpreter fixes, the result is stored in a per-interpreter storage. The program is still running when a subinterpreter can be destroyed. Should we always leak the per-interpreter results (there will be multiple leaks), or free them using some approach (e.g., atexit)?

Dec 11 '25 03:12 XuehaiPan

It is intentionally leaked

I suppose the pointer isn't leaked, but the pointee is leaked. Static variables (including the (trivial) pointer) are destroyed at the end of the program. In the user-defined model using std::call_once you would generally want to set up an equivalent mechanism: during initialization, you should register a corresponding cleanup action somewhere, and on interpreter shutdown you should execute all those cleanup operations. (The same thing happens when a block-local static variable is initialized; this creates a dynamic entry in a finilizer list.)

What you choose to do with that logic is up to you, and you could indeed once again make a dynamic allocation during initialization that you never deallocate. However, this would seem like a very bad idea. The entire point of properly managing per-interpreter state locally is that interpreters can be brought up and down at will (almost "like normal code in a composable production ecosystem", one might glibly say; or others might say "like Lua"). Not being able to destroy state is a big problem, not just in that case, but in programming at production quality in general.

Perhaps the solution, as often the case, is an extra level of indirection: if there is some global system that you cannot shut down, then don't initialize it from (the|any) Python interpreter directly, but instead make each interpreter initialize something that accesses some global singleton. Then each interpreter is conceptually self-contained, but uses a shared environment (similar to how each interpreter uses the same, global stdout, say). The global, shared system needs to support potentially concurrent use, and it should provide its own API to access its singleton instance, and the interpreter(s) can call that to obtain a reference to it.

Would that work?

Dec 11 '25 14:12 tkoeppe