cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Python 3.14 stack overflow detection is incompatible with C++ Boost make_fcontext() coroutines

Open vstinner opened this issue 3 months ago • 29 comments

Python 3.14 introduced a new stack overflow detection mecanism: InternalDocs/stack_protection.md (#130396).

The KiCad application uses C++ Boost make_fcontext() coroutines which runs coroutine in their own stack.

Code example from fcontext doc:

// context-function
void f(intptr);

// creates a new stack
std::size_t size = 8192;
void* sp(std::malloc(size));

// context fc uses f() as context function
// fcontext_t is placed on top of context stack
// a pointer to fcontext_t is returned
fcontext_t fc(make_fcontext(sp,size,f));

_Py_InitializeRecursionLimits() is called in the main thread, whereas _Py_CheckRecursiveCall() is called for the first time in a coroutine (make_fcontext()).

Problem: Python detects a stack overflow because it's not aware that the stack base address and size changed when make_fcontext() was called.

pthread functions such as pthread_attr_getguardsize() are incompatible with make_fcontext().

cc @markshannon

Linked PRs

  • gh-139667
  • gh-139668
  • gh-141551
  • gh-141661
  • gh-141711
  • gh-141892
  • gh-141944

vstinner avatar Oct 06 '25 12:10 vstinner

@hugovk: I used the release-blocker label to raise awareness of this issue, but I'm not sure that it should hold Python 3.14.0 final release. We might be able to develop a workaround in Python 3.14.1.

vstinner avatar Oct 06 '25 12:10 vstinner

KiCad issue: https://gitlab.com/kicad/code/kicad/-/issues/21890#note_2800807662

vstinner avatar Oct 06 '25 12:10 vstinner

cc @markshannon

encukou avatar Oct 06 '25 13:10 encukou

To me the simplest partial fix would be to reread the stack limits before throwing the abort; that would stop spurious aborts (and be low cost), however there's also the opposite case, where if the new stack is higher than the old one, the check won't trigger at all.

penguin42 avatar Oct 06 '25 14:10 penguin42

Do we need a new (internal) API that can be called after jump_fcontext? Presumably it only occurs when user code explicitly requests it, which should mean that it can also call a CPython API after switching if needed, and it looks like the members of fcontext_t are public, so we can accept them as arguments.

zooba avatar Oct 06 '25 14:10 zooba

Do these coroutines each use their own Python thread state?

reread the stack limits before throwing the abort

Unfortunately on some platforms where we can't read stack limits, so we approximate stack top using the current address. In that case, stack limits will be different each time.

encukou avatar Oct 06 '25 14:10 encukou

We can (mostly) do this without a new API, using a variant of @penguin42's suggestion.

What we would do is:

  • Change the recursion check from here_addr < tstate->c_stack_soft_limit to here_addr - state->c_stack_window_base < WINDOW_SIZE
  • If the check fails, check if there is an actual stack overflow, then update the window if we are within the stack.

The actual window size doesn't matter, but it should be a constant to keep the cost low. It will a bit slower, but probably not measurably so.

Unfortunately on some platforms where we can't read stack limits, so we approximate stack top using the current address. In that case, stack limits will be different each time.

True, but this bug has only surfaced on linux (so far).

markshannon avatar Oct 06 '25 14:10 markshannon

Having said that, new unstable ABI to notify of stack changes is probably the best long term option.

markshannon avatar Oct 06 '25 15:10 markshannon

Do these coroutines each use their own Python thread state?

Coroutines reuse the existing Python thread state.

Do we need a new (internal) API that can be called after jump_fcontext?

It sounds like a good idea, but this function would need a stack base address and stack size parameters, since pthread_attr_getguardsize() doesn't work in this case. make_fcontext() has a known stack base address and size: its the first and second arguments.

vstinner avatar Oct 06 '25 15:10 vstinner

##139667: _Py_InitializeRecursionLimits could detect that the platform-specific API is “lying”, and fall back to here_addr. Then, users should be fine if they call _Py_InitializeRecursionLimits after each context switch to reset the stack limits.

encukou avatar Oct 06 '25 15:10 encukou

(edited: I meant _Py_InitializeRecursionLimits, not _Py_get_machine_stack_pointer)

encukou avatar Oct 06 '25 16:10 encukou

I created https://github.com/python/cpython/pull/139668 to add PyUnstable_ThreadState_SetStack() and PyUnstable_ThreadState_ResetStack() functions.

vstinner avatar Oct 06 '25 16:10 vstinner

@hugovk: I used the release-blocker label to raise awareness of this issue, but I'm not sure that it should hold Python 3.14.0 final release. We might be able to develop a workaround in Python 3.14.1.

Thanks, this can wait for 3.14.1.

hugovk avatar Oct 07 '25 08:10 hugovk

I have the same problem with codes that use Argobots, a C library to create co-routines in a way similar to Boost co-routines. If Python code executes in an another co-routine, _Py_CheckRecursiveCall wrongly detects a stack overflow.

mdorier avatar Oct 17 '25 16:10 mdorier

Do you think that you can test my pull request: build Python 3.15 with this change (from my branch) and adapt your code to call PyThreadState_SetStack()?

vstinner avatar Oct 17 '25 16:10 vstinner

We appear to be having the same issue over in PythonCall.jl, a Julia package for interoperating with Python. Similar to other reports, it seems to be happening when calling into Python from a task (Julia co-routine) that isn't the main task, which presumably has its own stack a long way from the main stack.

https://github.com/JuliaPy/PythonCall.jl/issues/694

I'll try out PyUnstable_ThreadState_SetStack but it's unclear what the arguments should be - such details about the stack will be implementation details and hard to get at.

cjdoris avatar Oct 26 '25 16:10 cjdoris

@markshannon Do you have an idea what should be done if it's not feasible to get the stack size?

I guess one can disable the protection by calling PyUnstable_ThreadState_SetStack with the entire address space...

encukou avatar Oct 27 '25 11:10 encukou

Adding onto this list of users with the same problem, we have the same issue in the Realm runtime for HPC that does its own user-level threading.

https://github.com/StanfordLegion/realm/issues/353

lightsighter avatar Nov 07 '25 08:11 lightsighter

I merged my PR as the change https://github.com/python/cpython/commit/b99db92dde38b17c3fba3b5db76a383ceddfce49 which adds PyUnstable_ThreadState_SetStackProtection(start_address, size) and PyUnstable_ThreadState_ResetStackProtection() functions.

Someone should now try these functions on KiCad, Argobots, PythonCall.jl, Realm runtime, and other affected projects.

We should decide if this API is relevant, usable on these affected projects, or if another approach should be tried. If the PyUnstable_ThreadState_SetStackProtection() way solve the issue for most affected projects, we can consider backporting the functions to the 3.14 branch.

From what I understood, another approach would be to switch to Python 3.13 counter approach. But I don't know if it can be implemented in practice and how.

vstinner avatar Nov 13 '25 16:11 vstinner

We should decide if this API is relevant, usable on these affected projects

Just adding my two cents: I think the API can be useful to the extent that it will allow us to confirm that this is indeed the issue we're dealing with and we can temporarily work around it, but I don't think it is a viable long-term solution. Consider for example the reproducer here. A Realm client is using an off-the-shelf library like pybind to call into Python from C++. None of the three parties in this program have all the knowledge necessary to correctly place the relevant API calls precisely in all contexts:

  • Realm knows when it is doing a green threading context switch, but has no idea which tasks (if any) are using Python (>99% of all Realm tasks do not use Python).
  • pybind has no knowledge of the context in which it is being used, in particular it doesn't know when it is used in the context of a system with green threads like Realm.
  • The client knows that Realm is using green threads and that they are calling into Python from C++, but have no knowledge of when context switches might have transpired. They might even transpire inside a Python library that calls back into C++ and back into Realm (e.g. cupynumeric).

Realm is not going to call these functions every time it does a context switch because that adds overhead to all the clients not using Python. The pybind library is probably going to be unwilling to call these functions defensively in all cases since they also do not want to add overheads unnecessary for the vast majority of their clients not using green threads. Clients are then left in the unenviable position of needing to navigate a potentially large and complicated software stack with many libraries and identify all the possible places where they might possibly need to add such calls and litter their code base with them just to get their code running.

It feels to me like Python is making an assumption about the relationship between (kernel) threads and stacks that is beyond its purview and in general not a safe thing to assume. While green threads are one way that this assumption is violated it's not the only way. Some threading libraries also use cactus stacks to manage light-weight thread forking and would probably also violate the linear stack assumption being made here. Given that the relationship between stacks and (kernel) threads is something that isn't controlled by Python (except when threads are made through the Python interpreter), I think it should avoid enforcing these checks by default on "external" threads. A better solution would be to have the checks be on by default for threads made through the Python interpreter and off by default for threads not created through the interpreter (which can be easily detected) and then have a way for clients to dynamically toggle whether the checks are on/off for their specific thread at their prerogative.

lightsighter avatar Nov 14 '25 05:11 lightsighter

I was going to write a similarly long analysis for Argobots, but it's essentially the same as Realm. The only way Argobots could use this is if it placed a call at every possible context-switching point (so when a ULT is created, when it acquires a lock/mutex/future, gets out of a barrier, yields, etc.). It would add a large overhead (context-switching is otherwise a few cycles in Argobots), just on the off-chance that Python might be called from the ULT. Stack overflow detection is also not needed from Python as Argobots has its own opt-in mechanism (based on mprotect) to detect stack overflows in ULTs.

mdorier avatar Nov 14 '25 09:11 mdorier

Very similar story here for my Julia package PythonCall - its going to be practically impossible to know where to put these. The Julia task manager doesn't know if a task (which has its own stack) might call into Python, and some Python code in Julia will be written agnostic to which task it occurs in. So nobody has the information to insert these calls.

The best I could do in PythonCall would be to call this function at initialisation time with the entire address space to effectively disable these checks.

As with other comments, Julia has its own stack protection mechanisms too.

Is it feasible to assume that if the stack pointer has moved waaaayyy outside the expected range (more than a megabyte say) that we are now on a different stack, and so the limits should be recomputed? Or go back to the old counting method?

cjdoris avatar Nov 14 '25 10:11 cjdoris

Is it feasible to assume that if the stack pointer has moved waaaayyy outside the expected range (more than a megabyte say) that we are now on a different stack, and so the limits should be recomputed? Or go back to the old counting method?

I don't think it's reasonable. In Argobots we may have many tasks, each with a small stack, and these stacks may have been pre-allocated from a larger memory pool, so you could jump less than a MB and be in a different stack.

mdorier avatar Nov 14 '25 11:11 mdorier

Well, Mark is silent, and I don't see a good solution. I started a Discuss topic for more visibility.

@hugovk As RM you'll probably want to read that. Sorry :(

encukou avatar Nov 14 '25 17:11 encukou

Are we ready to close this issue?

hugovk avatar Nov 28 '25 16:11 hugovk

Ok, let's close the issue.

vstinner avatar Nov 28 '25 17:11 vstinner

I'm still waiting for @markshannon's answer regarding platforms without stack limits API: https://discuss.python.org/t/104880/28 How do we avoid crashes there?

encukou avatar Nov 28 '25 17:11 encukou

Sorry newbie question - is there a build I can use containing the above PR fix or do I need to build Python myself? This is so I can test it against my Julia package PythonCall.jl.

cjdoris avatar Nov 28 '25 17:11 cjdoris

@cjdoris Right now, you'll need to build it yourself; see the devguide for info. The next release of 3.14 is planned for Tuesday (see PEP 745) and the next 3.15 for two weeks after that (see PEP 790).

hugovk avatar Nov 28 '25 17:11 hugovk