Bug report

Bug description:

During a PGO build on Windows on main (471aa75), a test in test_functools fails with a stack overflow.

PCbuild\build.bat --pgo

0:02:50 load avg: 0.35 [19/44] test_functools
Windows fatal exception: stack overflow

Thread 0x00002b38 (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\libregrtest\win_utils.py", line 43 in _update_load

Current thread 0x00000e3c (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1[875](https://github.com/faster-cpython/benchmarking/actions/runs/7367240496/job/20050307979#step:10:876) in fib
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib
  ...

C:\actions-runner\_work\benchmarking\benchmarking\cpython>"C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\msbuild.exe" "C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\\pythoncore.vcxproj" /t:KillPython /nologo /v:m /clp:summary /p:Configuration=PGInstrument /p:Platform=x64 /p:KillPython=true 

  Killing any running python.exe instances...

C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pyproject.props(184,5): error : PGO run did not succeed (no python313!*.pgc files) and there is no data to merge [C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pythoncore.vcxproj]

Build FAILED.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

Linked PRs

gh-113668
gh-113701
gh-113944
gh-114148
gh-114263
gh-114896
gh-115770
gh-124264

Jan 02 '24 14:01 mdboom

Trying to repro, I can't even manage to complete a PGO build (first step above).

Jan 02 '24 17:01 gvanrossum

It sounds like you are repro'ing (I probably wasn't clear enough). It fails during the PGO build while running the tests that generate PGO data.

git bisect tells me the first bad commit is: 45e09f921be55e23bed19b5db4c95ce7bd7aad6b (which increased the recursion limit, which at least seems related).

Jan 02 '24 18:01 mdboom

It looks like 45e09f921be55e23bed19b5db4c95ce7bd7aad6b introduced the failing test, so it could just be that the test is incorrect (rather than a test going from passing to failing).

Jan 02 '24 18:01 mdboom

Ah, you're right, the output similar to yours scrolled off my screen. :-(

You should be able to experiment with some of the numbers changed by the PR to see which one affects the Windows PGO build. The stack limit is definitely an ongoing discussion.

Jan 02 '24 19:01 gvanrossum

Windows release buildbots are similarly failing: https://buildbot.python.org/all/#/builders/914/builds/3274/steps/4/logs/stdio

Jan 02 '24 20:01 mdboom

So it is crashing in this addition to test_functools.py:

    @support.skip_on_s390x
    @unittest.skipIf(support.is_wasi, "WASI has limited C stack")
    def test_lru_recursion(self):

        @self.module.lru_cache
        def fib(n):
            if n <= 1:
                return n
            return fib(n-1) + fib(n-2)

        if not support.Py_DEBUG:
            with support.infinite_recursion():
                fib(2500)

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

Jan 02 '24 21:01 gvanrossum

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

Yes -- my experimentation arrived at that too. 5,000 is too large, but 4,000 seems to work. See #113668.

Jan 02 '24 21:01 mdboom

Also, this docstring (deleted by that PR) suggests a possible cause:

    """Set a lower limit for tests that interact with infinite recursions
    (e.g test_ast.ASTHelpers_Test.test_recursion_direct) since on some
    debug windows builds, due to not enough functions being inlined the
    stack size might not handle the default recursion limit (1000). See
    bpo-11105 for details."""

Maybe we are running into the inlining problem even in non-debug mode?

Jan 02 '24 21:01 gvanrossum

Let's put this on tomorrow's agenda.

Jan 02 '24 23:01 gvanrossum

Since Windows 8, we've had GetCurrentThreadStackLimits, which should be able to give us a better estimate of how many native frames will fit in various configurations. We can also pass in the initial stack size as a preprocessor variable, though it's very possible to create new threads with different sizes, and I believe we might use a smaller stack for new threads already?

If we have any references to stack variables that are passed between frames (e.g. a pointer to a local in EvalFrame that's available in the next EvalFrame) then we can get the size of each recursion from the current build. Of course, that'll vary based on how it recurses (and how many of our own native APIs it goes through), but it's likely better than guessing. At the very least, it might be helpful for some assertions to detect when the guesses are invalidated.

A true solution would be to use structured exception handling and catch the stack overflow in EvalFrame. That's going to leave things in a pretty messy state though (leaked references at a minimum, and likely worse), so I think it's better to just crash.

Jan 05 '24 15:01 zooba

Since Windows 8, we've had GetCurrentThreadStackLimits

We could also just check this against the address of a local in a function and raise if we're within some distance from the extent of the stack. It won't be a predictable number of recursions, and it may still be possible to use native recursion to cause a crash, but it should be fairly simple to implement and isn't really any worse than the alternatives.

Jan 05 '24 15:01 zooba

@markshannon It seems that even with #113944 merged, this bug still persists -- PGO build log. It's this line:

File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib

Jan 16 '24 17:01 mdboom

Ditto for the (test) release build: https://dev.azure.com/Python/cpython/_build/results?buildId=147745&view=logs&j=b9a7e24a-1d9b-5019-7703-8e6075dba299&t=e637a57b-456d-5614-0397-39924ec65ef2

Should we perhaps set the default limit based on "reasonable stack depth without users being aware that they'll need more" rather than trying to eke out as many frames as possible based on pre-compile time heuristics?

Jan 16 '24 17:01 zooba

Can we detect when profiling is being run? If so, we could skip the offending test, test_lru_recursion, during profiling.

Jan 16 '24 18:01 markshannon

~~Yeah, Lib/test/libregrtest/pgo.py has the set of implied command line options, so it can be added to that.~~ My bad, exclusions have already been separated out. But support.PGO should be True so can skip on that.

Jan 16 '24 19:01 zooba

My main concern though was whether we think this code path is important to profile? Are we likely to trace through it another way?

Jan 16 '24 19:01 zooba

~Even if we skip the test for profiling (I agree it probably doesn't make PGO better), wouldn't it just fail on full test suite runs elsewhere (and indicate a bug)?~

EDIT: I see the further discussion in #114148 now -- seems like just disabling in PGO isn't the plan now anyway.

Jan 16 '24 20:01 mdboom

We shouldn't have bigger stack frames in PGO build than a Release build, though that is what I'll be testing next once my build is done. It's just the instrumented build that is bigger than predicted.

Jan 16 '24 20:01 zooba

Turns out we do still crash - PGO must decide to use more locals, or inline something with locals. I'll try and figure out the stack size we need and we can just boost up the memory usage until we get a better way to prevent overflow.

Jan 16 '24 21:01 zooba

Looks like an extra 40KB is enough, but I'll add on a whole MB to be safe.

Jan 16 '24 21:01 zooba

The final change in #114148 adds an extra MB of stack space for PGO builds, which I'd rather not keep in there. Once the next alpha is out (today or tomorrow - currently blocked on this issue) then we can take that out and find a better way to prevent overflows without consuming extra memory for every thread.

Jan 16 '24 21:01 zooba

@zooba Does this still deserve to be labeled deferred-release-blocker?

May 07 '24 00:05 gvanrossum

Unless there's been another fix that isn't linked to this issue, then yes. Right now we can only release because I have a workaround in the release pipeline (50% more memory used for every thread) - hence deferred - but there's still a fix required.

May 07 '24 14:05 zooba

Should I look into this? Currently I have no idea what's going on, but I can dive in.

Jan 13 '25 12:01 encukou

I think this is resolved and should have been closed. Are you still able to reproduce it, @encukou?

Jan 13 '25 13:01 mdboom

I think this is resolved and should have been closed.

Where's the fix?

There's a workaround in the current build which significantly overallocates memory, otherwise we wouldn't be able to release. Build with %UseExtraStackReserve%=false to disable this and try it. (you'll find the use of this in python.vcxproj)

Jan 14 '25 00:01 zooba

The fix is to reduce Py_C_RECURSION_LIMIT and increase the stack size (StackReserveSize), no?

There's a workaround in the current build which significantly overallocates memory

Is it a real issue to allocate 8 MB stack?

Jan 14 '25 07:01 vstinner

Is it a real issue to allocate 8 MB stack?

Depends how many threads you create. IIRC, all the physical memory is reserved to ensure it will be contiguous, so you will run out of RAM 4x quicker than a 2MB stack.

Jan 17 '25 13:01 zooba

I'm still not sure I see how this is a blocker bug. What's being blocked? The issue is that PGO builds need more stack memory than usual, right? Is this a (RAM) optimization issue?

Jan 20 '25 15:01 encukou

The final release build requires more stack memory than usual because PGO was enabled, yes (or no, if you were thinking something different).

It's a C stack optimization issue. For whatever reason, due to how the stack ends up looking after PGO is applied (presumably to the main interpreter loop), we need to allocate more memory for every thread to account for the expected number of C recursions. Rather than revert the entire complex change that led to that regression, I chose to temporarily expand the memory used and left it marked as a deferred blocker. Without that workaround, we would've reverted a significant change (improvement, on the whole) to recursion tracking so we could release.

It would be great to handle the recursion count better so that we don't have to experimentally reduce the limit each time something changes in the interpreter loop structure. But there's a whole paid team working on that, so I was hoping that by leaving it assigned to them, they'd do it. I'm not happy with saying "let's charge every user 1MB/thread because our team couldn't fix this".

Jan 20 '25 16:01 zooba

Stack overflow collecting PGO data on Windows

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs