cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Stack overflow collecting PGO data on Windows

Open mdboom opened this issue 2 years ago • 43 comments

Bug report

Bug description:

During a PGO build on Windows on main (471aa75), a test in test_functools fails with a stack overflow.

Full log of build

PCbuild\build.bat --pgo
0:02:50 load avg: 0.35 [19/44] test_functools
Windows fatal exception: stack overflow

Thread 0x00002b38 (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\libregrtest\win_utils.py", line 43 in _update_load

Current thread 0x00000e3c (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1[875](https://github.com/faster-cpython/benchmarking/actions/runs/7367240496/job/20050307979#step:10:876) in fib
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib
  ...

C:\actions-runner\_work\benchmarking\benchmarking\cpython>"C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\msbuild.exe" "C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\\pythoncore.vcxproj" /t:KillPython /nologo /v:m /clp:summary /p:Configuration=PGInstrument /p:Platform=x64 /p:KillPython=true 

  Killing any running python.exe instances...

C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pyproject.props(184,5): error : PGO run did not succeed (no python313!*.pgc files) and there is no data to merge [C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pythoncore.vcxproj]

Build FAILED.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

Linked PRs

  • gh-113668
  • gh-113701
  • gh-113944
  • gh-114148
  • gh-114263
  • gh-114896
  • gh-115770
  • gh-124264

mdboom avatar Jan 02 '24 14:01 mdboom

Trying to repro, I can't even manage to complete a PGO build (first step above).

gvanrossum avatar Jan 02 '24 17:01 gvanrossum

It sounds like you are repro'ing (I probably wasn't clear enough). It fails during the PGO build while running the tests that generate PGO data.

git bisect tells me the first bad commit is: 45e09f921be55e23bed19b5db4c95ce7bd7aad6b (which increased the recursion limit, which at least seems related).

mdboom avatar Jan 02 '24 18:01 mdboom

It looks like 45e09f921be55e23bed19b5db4c95ce7bd7aad6b introduced the failing test, so it could just be that the test is incorrect (rather than a test going from passing to failing).

mdboom avatar Jan 02 '24 18:01 mdboom

Ah, you're right, the output similar to yours scrolled off my screen. :-(

You should be able to experiment with some of the numbers changed by the PR to see which one affects the Windows PGO build. The stack limit is definitely an ongoing discussion.

gvanrossum avatar Jan 02 '24 19:01 gvanrossum

Windows release buildbots are similarly failing: https://buildbot.python.org/all/#/builders/914/builds/3274/steps/4/logs/stdio

mdboom avatar Jan 02 '24 20:01 mdboom

So it is crashing in this addition to test_functools.py:

    @support.skip_on_s390x
    @unittest.skipIf(support.is_wasi, "WASI has limited C stack")
    def test_lru_recursion(self):

        @self.module.lru_cache
        def fib(n):
            if n <= 1:
                return n
            return fib(n-1) + fib(n-2)

        if not support.Py_DEBUG:
            with support.infinite_recursion():
                fib(2500)

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

gvanrossum avatar Jan 02 '24 21:01 gvanrossum

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

Yes -- my experimentation arrived at that too. 5,000 is too large, but 4,000 seems to work. See #113668.

mdboom avatar Jan 02 '24 21:01 mdboom

Also, this docstring (deleted by that PR) suggests a possible cause:

    """Set a lower limit for tests that interact with infinite recursions
    (e.g test_ast.ASTHelpers_Test.test_recursion_direct) since on some
    debug windows builds, due to not enough functions being inlined the
    stack size might not handle the default recursion limit (1000). See
    bpo-11105 for details."""

Maybe we are running into the inlining problem even in non-debug mode?

gvanrossum avatar Jan 02 '24 21:01 gvanrossum

Let's put this on tomorrow's agenda.

gvanrossum avatar Jan 02 '24 23:01 gvanrossum

Since Windows 8, we've had GetCurrentThreadStackLimits, which should be able to give us a better estimate of how many native frames will fit in various configurations. We can also pass in the initial stack size as a preprocessor variable, though it's very possible to create new threads with different sizes, and I believe we might use a smaller stack for new threads already?

If we have any references to stack variables that are passed between frames (e.g. a pointer to a local in EvalFrame that's available in the next EvalFrame) then we can get the size of each recursion from the current build. Of course, that'll vary based on how it recurses (and how many of our own native APIs it goes through), but it's likely better than guessing. At the very least, it might be helpful for some assertions to detect when the guesses are invalidated.

A true solution would be to use structured exception handling and catch the stack overflow in EvalFrame. That's going to leave things in a pretty messy state though (leaked references at a minimum, and likely worse), so I think it's better to just crash.

zooba avatar Jan 05 '24 15:01 zooba

Since Windows 8, we've had GetCurrentThreadStackLimits

We could also just check this against the address of a local in a function and raise if we're within some distance from the extent of the stack. It won't be a predictable number of recursions, and it may still be possible to use native recursion to cause a crash, but it should be fairly simple to implement and isn't really any worse than the alternatives.

zooba avatar Jan 05 '24 15:01 zooba

@markshannon It seems that even with #113944 merged, this bug still persists -- PGO build log. It's this line:

File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib

mdboom avatar Jan 16 '24 17:01 mdboom

Ditto for the (test) release build: https://dev.azure.com/Python/cpython/_build/results?buildId=147745&view=logs&j=b9a7e24a-1d9b-5019-7703-8e6075dba299&t=e637a57b-456d-5614-0397-39924ec65ef2

Should we perhaps set the default limit based on "reasonable stack depth without users being aware that they'll need more" rather than trying to eke out as many frames as possible based on pre-compile time heuristics?

zooba avatar Jan 16 '24 17:01 zooba

Can we detect when profiling is being run? If so, we could skip the offending test, test_lru_recursion, during profiling.

markshannon avatar Jan 16 '24 18:01 markshannon

~~Yeah, Lib/test/libregrtest/pgo.py has the set of implied command line options, so it can be added to that.~~ My bad, exclusions have already been separated out. But support.PGO should be True so can skip on that.

zooba avatar Jan 16 '24 19:01 zooba

My main concern though was whether we think this code path is important to profile? Are we likely to trace through it another way?

zooba avatar Jan 16 '24 19:01 zooba

~Even if we skip the test for profiling (I agree it probably doesn't make PGO better), wouldn't it just fail on full test suite runs elsewhere (and indicate a bug)?~

EDIT: I see the further discussion in #114148 now -- seems like just disabling in PGO isn't the plan now anyway.

mdboom avatar Jan 16 '24 20:01 mdboom

We shouldn't have bigger stack frames in PGO build than a Release build, though that is what I'll be testing next once my build is done. It's just the instrumented build that is bigger than predicted.

zooba avatar Jan 16 '24 20:01 zooba

Turns out we do still crash - PGO must decide to use more locals, or inline something with locals. I'll try and figure out the stack size we need and we can just boost up the memory usage until we get a better way to prevent overflow.

zooba avatar Jan 16 '24 21:01 zooba

Looks like an extra 40KB is enough, but I'll add on a whole MB to be safe.

zooba avatar Jan 16 '24 21:01 zooba

The final change in #114148 adds an extra MB of stack space for PGO builds, which I'd rather not keep in there. Once the next alpha is out (today or tomorrow - currently blocked on this issue) then we can take that out and find a better way to prevent overflows without consuming extra memory for every thread.

zooba avatar Jan 16 '24 21:01 zooba

@zooba Does this still deserve to be labeled deferred-release-blocker?

gvanrossum avatar May 07 '24 00:05 gvanrossum

Unless there's been another fix that isn't linked to this issue, then yes. Right now we can only release because I have a workaround in the release pipeline (50% more memory used for every thread) - hence deferred - but there's still a fix required.

zooba avatar May 07 '24 14:05 zooba

Should I look into this? Currently I have no idea what's going on, but I can dive in.

encukou avatar Jan 13 '25 12:01 encukou

I think this is resolved and should have been closed. Are you still able to reproduce it, @encukou?

mdboom avatar Jan 13 '25 13:01 mdboom

I think this is resolved and should have been closed.

Where's the fix?

There's a workaround in the current build which significantly overallocates memory, otherwise we wouldn't be able to release. Build with %UseExtraStackReserve%=false to disable this and try it. (you'll find the use of this in python.vcxproj)

zooba avatar Jan 14 '25 00:01 zooba

The fix is to reduce Py_C_RECURSION_LIMIT and increase the stack size (StackReserveSize), no?

There's a workaround in the current build which significantly overallocates memory

Is it a real issue to allocate 8 MB stack?

vstinner avatar Jan 14 '25 07:01 vstinner

Is it a real issue to allocate 8 MB stack?

Depends how many threads you create. IIRC, all the physical memory is reserved to ensure it will be contiguous, so you will run out of RAM 4x quicker than a 2MB stack.

zooba avatar Jan 17 '25 13:01 zooba

I'm still not sure I see how this is a blocker bug. What's being blocked? The issue is that PGO builds need more stack memory than usual, right? Is this a (RAM) optimization issue?

encukou avatar Jan 20 '25 15:01 encukou

The final release build requires more stack memory than usual because PGO was enabled, yes (or no, if you were thinking something different).

It's a C stack optimization issue. For whatever reason, due to how the stack ends up looking after PGO is applied (presumably to the main interpreter loop), we need to allocate more memory for every thread to account for the expected number of C recursions. Rather than revert the entire complex change that led to that regression, I chose to temporarily expand the memory used and left it marked as a deferred blocker. Without that workaround, we would've reverted a significant change (improvement, on the whole) to recursion tracking so we could release.

It would be great to handle the recursion count better so that we don't have to experimentally reduce the limit each time something changes in the interpreter loop structure. But there's a whole paid team working on that, so I was hoping that by leaving it assigned to them, they'd do it. I'm not happy with saying "let's charge every user 1MB/thread because our team couldn't fix this".

zooba avatar Jan 20 '25 16:01 zooba