runtime
runtime copied to clipboard
Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=527171 Build error leg or test failing: System.Text.Json.Tests Pull request: https://github.com/dotnet/runtime/pull/96894
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "exit code 137 means SIGKILL Killed",
"ErrorPattern": "",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=527171
Error message validated: exit code 137 means SIGKILL Killed
Result validation: :white_check_mark: Known issue matched with the provided build.
Validation performed at: 1/25/2024 7:09:47 PM UTC
Report
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 45 | 369 | 1030 |
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries See info in area-owners.md if you want to be subscribed.
Issue Details
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=527171 Build error leg or test failing: System.IO.Tests.File_GetSetTimes_SafeFileHandle.WritingShouldUpdateWriteTime_After_SetLastAccessTime Pull request: https://github.com/dotnet/runtime/pull/96894
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "exit code 137 means SIGKILL Killed eg by kill",
"ErrorPattern": "",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
| Author: | jkotas |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | - |
Discovering: System.Text.Json.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.Text.Json.Tests (found 7275 of 7334 test cases)
Starting: System.Text.Json.Tests (parallel test collections = on, max threads = 6)
./RunTests.sh: line 180: 5589 Killed: 9 "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Text.Json.Tests.runtimeconfig.json --depsfile System.Text.Json.Tests.deps.json xunit.console.dll System.Text.Json.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/B0B70943/w/A9E50974/e
----- end Mon Jan 15 01:12:38 PST 2024 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
We need dumps to make this diagnosable.
SIGKILL is a pretty unusual way to take down the process... do we know if there's anything in the infra which can produce a SIGKILL?
Exit code 137 can be caused by OOM.
It doesn't look like we have a mechanism to grab dumps if it is OOM, though: https://github.com/dotnet/runtime/issues/52521
This seems to fail consistently on all PRs
I was able to catch a live local repro and attach debugger to it. There is one run away thread with extremely deep stack trace. All other threads are waiting for the GC suspension to finish.
The run-away thread keeps allocating memory at very fast pace. You can see that by running top command in a second shell. Once it allocates about 100GB, the process gets killed.
The repro is sensitive to timing. It stopped reproing for me if I added any kind of verbose logging.
(lldb) bt
* thread #20
* frame #0: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #1: 0x0000000000000020 * frame #0: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #1: 0x0000000000000020
frame #2: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
frame #3: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
frame #4: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
frame #5: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
frame #6: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #7: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
frame #8: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
frame #9: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
frame #10: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
frame #11: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #12: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
frame #13: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
frame #14: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
frame #15: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
frame #16: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #17: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
frame #18: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
frame #19: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
frame #20: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
frame #21: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #22: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
frame #23: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
frame #24: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
frame #25: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
...
frame #299994: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
frame #299995: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
frame #299996: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
frame #299997: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
frame #299998: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
frame #299999: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
(lldb)
@janvorli Could you please take a look? It is hit by nearly all CI jobs and it looks related to your EH refactoring.
I will take a look.
@jkotas do you happen to know which of the tests in the suite was failing when you were able to repro it? I am currently trying to run the System.Text.Json.Tests in a loop locally on the current main.
System.Text.Json.Tests, debug build of libraries, checked build of the runtime, native x64 macOS. I was not able to repro it with emulator on M1.
There are multiple issues:
- This type of crash does not produce dumps to allow diagnosing it (known infra gap)
- Stackoverflow handling on macOS goes into infinite memory allocation loop. @janvorli is looking into it. I have opened #98477 on it.
DeepNestedJsonFileTestinSystem.Text.Json.Testsconsumes a lot of stack, the stacktrace is several thousands frames deep. It hits stackoverflow some of the time. The non-deterministic stack consumption of tiered compilation and GC makes the repro non-deterministic. (cc @dotnet/area-system-text-json for awareness). This problem happened to be fixed by #98007 a few days ago that made the stack size on macOS larger.- Unrelated crashes have sneaked in the meantime because they had the same generic "exit code 137 means SIGKILL Killed" error message. New issues should be created for these as appropriate.