Stack overflow collecting PGO data on Windows
Bug report
Bug description:
During a PGO build on Windows on main (471aa75), a test in test_functools fails with a stack overflow.
PCbuild\build.bat --pgo
0:02:50 load avg: 0.35 [19/44] test_functools
Windows fatal exception: stack overflow
Thread 0x00002b38 (most recent call first):
File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\libregrtest\win_utils.py", line 43 in _update_load
Current thread 0x00000e3c (most recent call first):
File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1[875](https://github.com/faster-cpython/benchmarking/actions/runs/7367240496/job/20050307979#step:10:876) in fib
File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib
...
C:\actions-runner\_work\benchmarking\benchmarking\cpython>"C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\msbuild.exe" "C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\\pythoncore.vcxproj" /t:KillPython /nologo /v:m /clp:summary /p:Configuration=PGInstrument /p:Platform=x64 /p:KillPython=true
Killing any running python.exe instances...
C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pyproject.props(184,5): error : PGO run did not succeed (no python313!*.pgc files) and there is no data to merge [C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pythoncore.vcxproj]
Build FAILED.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Windows
Linked PRs
- gh-113668
- gh-113701
- gh-113944
- gh-114148
- gh-114263
- gh-114896
- gh-115770
- gh-124264
Trying to repro, I can't even manage to complete a PGO build (first step above).
It sounds like you are repro'ing (I probably wasn't clear enough). It fails during the PGO build while running the tests that generate PGO data.
git bisect tells me the first bad commit is: 45e09f921be55e23bed19b5db4c95ce7bd7aad6b (which increased the recursion limit, which at least seems related).
It looks like 45e09f921be55e23bed19b5db4c95ce7bd7aad6b introduced the failing test, so it could just be that the test is incorrect (rather than a test going from passing to failing).
Ah, you're right, the output similar to yours scrolled off my screen. :-(
You should be able to experiment with some of the numbers changed by the PR to see which one affects the Windows PGO build. The stack limit is definitely an ongoing discussion.
Windows release buildbots are similarly failing: https://buildbot.python.org/all/#/builders/914/builds/3274/steps/4/logs/stdio
So it is crashing in this addition to test_functools.py:
@support.skip_on_s390x
@unittest.skipIf(support.is_wasi, "WASI has limited C stack")
def test_lru_recursion(self):
@self.module.lru_cache
def fib(n):
if n <= 1:
return n
return fib(n-1) + fib(n-2)
if not support.Py_DEBUG:
with support.infinite_recursion():
fib(2500)
My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)
My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)
Yes -- my experimentation arrived at that too. 5,000 is too large, but 4,000 seems to work. See #113668.
Also, this docstring (deleted by that PR) suggests a possible cause:
"""Set a lower limit for tests that interact with infinite recursions
(e.g test_ast.ASTHelpers_Test.test_recursion_direct) since on some
debug windows builds, due to not enough functions being inlined the
stack size might not handle the default recursion limit (1000). See
bpo-11105 for details."""
Maybe we are running into the inlining problem even in non-debug mode?
Let's put this on tomorrow's agenda.
Since Windows 8, we've had GetCurrentThreadStackLimits, which should be able to give us a better estimate of how many native frames will fit in various configurations. We can also pass in the initial stack size as a preprocessor variable, though it's very possible to create new threads with different sizes, and I believe we might use a smaller stack for new threads already?
If we have any references to stack variables that are passed between frames (e.g. a pointer to a local in EvalFrame that's available in the next EvalFrame) then we can get the size of each recursion from the current build. Of course, that'll vary based on how it recurses (and how many of our own native APIs it goes through), but it's likely better than guessing. At the very least, it might be helpful for some assertions to detect when the guesses are invalidated.
A true solution would be to use structured exception handling and catch the stack overflow in EvalFrame. That's going to leave things in a pretty messy state though (leaked references at a minimum, and likely worse), so I think it's better to just crash.
Since Windows 8, we've had
GetCurrentThreadStackLimits
We could also just check this against the address of a local in a function and raise if we're within some distance from the extent of the stack. It won't be a predictable number of recursions, and it may still be possible to use native recursion to cause a crash, but it should be fairly simple to implement and isn't really any worse than the alternatives.
@markshannon It seems that even with #113944 merged, this bug still persists -- PGO build log. It's this line:
File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib
Ditto for the (test) release build: https://dev.azure.com/Python/cpython/_build/results?buildId=147745&view=logs&j=b9a7e24a-1d9b-5019-7703-8e6075dba299&t=e637a57b-456d-5614-0397-39924ec65ef2
Should we perhaps set the default limit based on "reasonable stack depth without users being aware that they'll need more" rather than trying to eke out as many frames as possible based on pre-compile time heuristics?
Can we detect when profiling is being run? If so, we could skip the offending test, test_lru_recursion, during profiling.
~~Yeah, Lib/test/libregrtest/pgo.py has the set of implied command line options, so it can be added to that.~~ My bad, exclusions have already been separated out. But support.PGO should be True so can skip on that.
My main concern though was whether we think this code path is important to profile? Are we likely to trace through it another way?
~Even if we skip the test for profiling (I agree it probably doesn't make PGO better), wouldn't it just fail on full test suite runs elsewhere (and indicate a bug)?~
EDIT: I see the further discussion in #114148 now -- seems like just disabling in PGO isn't the plan now anyway.
We shouldn't have bigger stack frames in PGO build than a Release build, though that is what I'll be testing next once my build is done. It's just the instrumented build that is bigger than predicted.
Turns out we do still crash - PGO must decide to use more locals, or inline something with locals. I'll try and figure out the stack size we need and we can just boost up the memory usage until we get a better way to prevent overflow.
Looks like an extra 40KB is enough, but I'll add on a whole MB to be safe.
The final change in #114148 adds an extra MB of stack space for PGO builds, which I'd rather not keep in there. Once the next alpha is out (today or tomorrow - currently blocked on this issue) then we can take that out and find a better way to prevent overflows without consuming extra memory for every thread.
@zooba Does this still deserve to be labeled deferred-release-blocker?
Unless there's been another fix that isn't linked to this issue, then yes. Right now we can only release because I have a workaround in the release pipeline (50% more memory used for every thread) - hence deferred - but there's still a fix required.
Should I look into this? Currently I have no idea what's going on, but I can dive in.
I think this is resolved and should have been closed. Are you still able to reproduce it, @encukou?
I think this is resolved and should have been closed.
Where's the fix?
There's a workaround in the current build which significantly overallocates memory, otherwise we wouldn't be able to release. Build with %UseExtraStackReserve%=false to disable this and try it. (you'll find the use of this in python.vcxproj)
The fix is to reduce Py_C_RECURSION_LIMIT and increase the stack size (StackReserveSize), no?
There's a workaround in the current build which significantly overallocates memory
Is it a real issue to allocate 8 MB stack?
Is it a real issue to allocate 8 MB stack?
Depends how many threads you create. IIRC, all the physical memory is reserved to ensure it will be contiguous, so you will run out of RAM 4x quicker than a 2MB stack.
I'm still not sure I see how this is a blocker bug. What's being blocked? The issue is that PGO builds need more stack memory than usual, right? Is this a (RAM) optimization issue?
The final release build requires more stack memory than usual because PGO was enabled, yes (or no, if you were thinking something different).
It's a C stack optimization issue. For whatever reason, due to how the stack ends up looking after PGO is applied (presumably to the main interpreter loop), we need to allocate more memory for every thread to account for the expected number of C recursions. Rather than revert the entire complex change that led to that regression, I chose to temporarily expand the memory used and left it marked as a deferred blocker. Without that workaround, we would've reverted a significant change (improvement, on the whole) to recursion tracking so we could release.
It would be great to handle the recursion count better so that we don't have to experimentally reduce the limit each time something changes in the interpreter loop structure. But there's a whole paid team working on that, so I was hoping that by leaving it assigned to them, they'd do it. I'm not happy with saying "let's charge every user 1MB/thread because our team couldn't fix this".