runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed

Open jkotas opened this issue 1 year ago • 6 comments

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=527171 Build error leg or test failing: System.Text.Json.Tests Pull request: https://github.com/dotnet/runtime/pull/96894

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "exit code 137 means SIGKILL Killed",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=527171 Error message validated: exit code 137 means SIGKILL Killed Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 1/25/2024 7:09:47 PM UTC

Report

Build Definition Test Pull Request
562164 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98261
557869 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98134
562928 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98299
562011 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#96370
562893 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98303
562843 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98117
562846 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98281
562835 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98302
562830 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98301
562799 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98298
562788 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98199
562784 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98299
562774 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97877
562752 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97537
562684 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97644
562536 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98277
562528 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98296
562540 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97508
562456 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98226
562378 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562382 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562384 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562379 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562375 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562424 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98294
562428 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#95001
562381 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562376 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562372 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562412 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98293
562367 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562374 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562368 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562380 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562371 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562395 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98087
562365 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562392 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98291
562353 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98289
562319 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#96650
562305 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97079
562250 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98152
562222 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97976
562110 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98284
562076 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98283
562028 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98281
562017 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98263
562020 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98273
562005 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98278
561996 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97537
561892 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97508
561889 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98277
561839 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561838 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561834 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561851 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561855 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561836 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561837 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561854 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561848 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561847 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561842 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561841 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561844 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561840 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561792 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98187
561770 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98273
561747 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98275
561698 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98272
561682 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98207
561670 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#95001
561661 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98271
561624 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98207
561619 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98238
561609 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98196
554493 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97415
561527 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98264
561530 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97508
561474 dotnet/runtime System.Runtime.Numerics.Tests.WorkItemExecution
561461 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561458 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561457 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561476 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561464 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561469 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561477 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561459 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561456 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561466 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561467 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561462 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561471 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561405 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98233
561402 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98263
561350 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#97738
561326 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98261
561303 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98238
561296 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98048
561292 dotnet/runtime System.Text.Json.Tests.WorkItemExecution dotnet/runtime#98207
Displaying 100 of 1030 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
45 369 1030

jkotas avatar Jan 16 '24 21:01 jkotas

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=527171 Build error leg or test failing: System.IO.Tests.File_GetSetTimes_SafeFileHandle.WritingShouldUpdateWriteTime_After_SetLastAccessTime Pull request: https://github.com/dotnet/runtime/pull/96894

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "exit code 137 means SIGKILL Killed eg by kill",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}
Author: jkotas
Assignees: -
Labels:

area-Infrastructure-libraries, blocking-clean-ci, Known Build Error

Milestone: -

ghost avatar Jan 16 '24 21:01 ghost

  Discovering: System.Text.Json.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Json.Tests (found 7275 of 7334 test cases)
  Starting:    System.Text.Json.Tests (parallel test collections = on, max threads = 6)
./RunTests.sh: line 180:  5589 Killed: 9               "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Text.Json.Tests.runtimeconfig.json --depsfile System.Text.Json.Tests.deps.json xunit.console.dll System.Text.Json.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/B0B70943/w/A9E50974/e
----- end Mon Jan 15 01:12:38 PST 2024 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill

We need dumps to make this diagnosable.

jkotas avatar Jan 16 '24 21:01 jkotas

SIGKILL is a pretty unusual way to take down the process... do we know if there's anything in the infra which can produce a SIGKILL?

agocke avatar Jan 16 '24 21:01 agocke

Exit code 137 can be caused by OOM.

jkotas avatar Jan 16 '24 21:01 jkotas

It doesn't look like we have a mechanism to grab dumps if it is OOM, though: https://github.com/dotnet/runtime/issues/52521

agocke avatar Jan 16 '24 21:01 agocke

This seems to fail consistently on all PRs

EgorBo avatar Feb 07 '24 17:02 EgorBo

I was able to catch a live local repro and attach debugger to it. There is one run away thread with extremely deep stack trace. All other threads are waiting for the GC suspension to finish.

The run-away thread keeps allocating memory at very fast pace. You can see that by running top command in a second shell. Once it allocates about 100GB, the process gets killed.

The repro is sensitive to timing. It stopped reproing for me if I added any kind of verbose logging.

(lldb) bt
* thread #20
  * frame #0: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #1: 0x0000000000000020  * frame #0: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #1: 0x0000000000000020
    frame #2: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #3: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #4: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #5: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #6: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #7: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #8: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #9: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #10: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #11: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #12: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #13: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #14: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #15: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #16: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #17: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #18: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #19: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #20: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #21: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #22: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #23: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #24: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #25: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
...
    frame #299994: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #299995: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #299996: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #299997: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #299998: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #299999: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
(lldb)

jkotas avatar Feb 13 '24 17:02 jkotas

@janvorli Could you please take a look? It is hit by nearly all CI jobs and it looks related to your EH refactoring.

jkotas avatar Feb 13 '24 17:02 jkotas

I will take a look.

janvorli avatar Feb 13 '24 17:02 janvorli

@jkotas do you happen to know which of the tests in the suite was failing when you were able to repro it? I am currently trying to run the System.Text.Json.Tests in a loop locally on the current main.

janvorli avatar Feb 13 '24 20:02 janvorli

System.Text.Json.Tests, debug build of libraries, checked build of the runtime, native x64 macOS. I was not able to repro it with emulator on M1.

jkotas avatar Feb 13 '24 21:02 jkotas

There are multiple issues:

  • This type of crash does not produce dumps to allow diagnosing it (known infra gap)
  • Stackoverflow handling on macOS goes into infinite memory allocation loop. @janvorli is looking into it. I have opened #98477 on it.
  • DeepNestedJsonFileTest in System.Text.Json.Tests consumes a lot of stack, the stacktrace is several thousands frames deep. It hits stackoverflow some of the time. The non-deterministic stack consumption of tiered compilation and GC makes the repro non-deterministic. (cc @dotnet/area-system-text-json for awareness). This problem happened to be fixed by #98007 a few days ago that made the stack size on macOS larger.
  • Unrelated crashes have sneaked in the meantime because they had the same generic "exit code 137 means SIGKILL Killed" error message. New issues should be created for these as appropriate.

jkotas avatar Feb 15 '24 07:02 jkotas