runtime icon indicating copy to clipboard operation
runtime copied to clipboard

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137

Open carlossanlop opened this issue 1 year ago • 3 comments
trafficstars

The System.IO.Net5Compat.Tests and the System.IO.Tests test processes are intermittengly getting killed on Linux shortly after starting, and the exit code is 137.

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407 Build error leg or test failing: System.IO.Net5Compat.Tests

Error Message

{
  "ErrorPattern": ["Starting:    System\\.IO\\.(Net5Compat\\.)?Tests", "exit code 137"],
  "BuildRetry" : true,
  "ExcludeConsoleLog" : false
}

System.IO.Net5Compat.Tests example

  • PR: https://github.com/dotnet/runtime/pull/100545
  • Queue: Libraries Test Run release coreclr Linux_musl x64 Debug
  • Job result: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407&view=logs&j=73b7db51-6fbc-594d-25c9-15b2737b0bf6
  • Log file: https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-runtime-refs-pull-100545-merge-f07349d435f3464e8b/System.IO.Net5Compat.Tests/1/console.71f42a07.log?helixlogtype=result
  • Output:
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.IO.Net5Compat.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.Net5Compat.Tests (found 679 of 685 test cases)
  Starting:    System.IO.Net5Compat.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162:    25 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.Net5Compat.Tests.runtimeconfig.json --depsfile System.IO.Net5Compat.Tests.deps.json xunit.console.dll System.IO.Net5Compat.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Apr 2 20:20:02 UTC 2024 ----- exit code 137 ----------------------------------------------------------

System.IO.Test example

  • PR: https://github.com/dotnet/runtime/pull/100545
  • Queue: Libraries Test Run release mono Linux x64 Debug
  • Job result: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407&view=logs&j=b4344b0d-0f92-5d69-ccaf-e0b24fbf14a2&t=65a58faa-a542-5344-b5c8-897139f242cd&l=888
  • Log file: https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-runtime-refs-pull-100545-merge-f07349d435f3464e8b/System.IO.Tests/1/console.effe990e.log?helixlogtype=result
  • Output:
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.IO.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.Tests (found 679 of 685 test cases)
  Starting:    System.IO.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162:    25 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.Tests.runtimeconfig.json --depsfile System.IO.Tests.deps.json xunit.console.dll System.IO.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Apr 2 20:20:10 UTC 2024 ----- exit code 137 ----------------------------------------------------------

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407 Error message validated: [Starting: System\.IO\.(Net5Compat\.)?Tests exit code 137] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 4/2/2024 11:08:28 PM UTC

Report

Build Definition Test Pull Request
724949 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103635
724846 dotnet/runtime System.IO.Tests.WorkItemExecution
724783 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103837
724760 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104170
724639 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103317
724600 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103444
724483 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104177
723328 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103937
724348 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102705
724264 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104065
724203 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104103
724194 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103752
724191 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#99871
724008 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#99212
723885 dotnet/runtime System.IO.Tests.WorkItemExecution
723907 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103889
722386 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104078
723595 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104130
723068 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104118
722821 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104104
722771 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103144
721977 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104066
721370 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104039
721001 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104017
720992 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103412
720979 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104025
720956 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104024
720898 dotnet/runtime System.IO.Tests.WorkItemExecution
720889 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103989
720862 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#104016
720564 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103935
720276 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103935
719981 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103895
719460 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103953
719275 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102704
719238 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102403
719025 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution dotnet/runtime#103935
718783 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103739
718528 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#100745
718427 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103567
718400 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103769
718321 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103820
718293 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103766
718230 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103801
718204 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103890
718172 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102739
718165 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103860
718038 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103635
717916 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution dotnet/runtime#103895
717900 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103871
717903 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102948
717815 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103890
717806 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103889
717653 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103852
717475 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103635
717434 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103868
717123 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103778
717075 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103857
717021 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103836
716966 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103216
716753 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103701
716749 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103852
716644 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103837
716544 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103848
716302 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103838
716259 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103361
716215 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103655
716194 dotnet/runtime System.IO.Tests.WorkItemExecution
716206 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103769
716128 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103835
716140 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103668
716062 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103663
716050 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103673
716155 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103827
715906 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#100379
715864 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103567
715775 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103755
715765 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103574
715780 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#100334
715431 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103805
715411 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103701
715323 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103794
715173 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103681
715115 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103791
715076 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103790
714003 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103412
713594 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103737
713565 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103728
713355 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103724
713550 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103735
713553 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103673
713121 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103710
713111 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103673
713005 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102739
713022 dotnet/runtime System.IO.Tests.WorkItemExecution
712742 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#103697
712674 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102248
712534 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#100334
712474 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#102144
712354 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#85694
Displaying 100 of 298 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
9 60 298

carlossanlop avatar Apr 02 '24 22:04 carlossanlop

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

@dotnet/area-system-io there are a lot of hits on this and relatively recent. It seems to me to be happening across many configurations. I think it's worth having a look.

ericstj avatar Apr 12 '24 01:04 ericstj

The System.IO.Net5Compat.Tests and the System.IO.Tests test processes are intermittengly getting killed on Linux shortly after starting, and the exit code is 137.

137 means out of memory. We have not made any changes to 6.0 in System.IO, so I expect that either there was some infra change (like less memory available) or a bug was introduced in the product itself. The bug would be specific to Linux.

@carlossanlop is it possible to perform some kind of binary search based on the merged PRs and when it started to fail?

adamsitnik avatar Apr 12 '24 10:04 adamsitnik

@adamsitnik @Jozkee This is one of the most impactful failures in servicing. It only affects System.IO.Tests and System.IO.Net5Compat.Tests. Any chance you can take a look soon?

carlossanlop avatar Jul 16 '24 00:07 carlossanlop

@carlossanlop sure, but could you please answer the question I've asked in https://github.com/dotnet/runtime/issues/100558#issuecomment-2051457144 ?

adamsitnik avatar Jul 16 '24 07:07 adamsitnik

Sorry, I missed that question. Yes, you can use Kusto. David has used it many times in the past.

carlossanlop avatar Jul 16 '24 17:07 carlossanlop

This is the super basic kusto query you can execute if looking via issue:

TestKnownIssues
| union KnownIssues
| where IssueId == ""

This database stores data from the last 4 months so hopefully there's still info from April.

This is the cluster where you would look for that info: https://dataexplorer.azure.com/clusters/dotnetperf.westus/databases/PerformanceData

Unfortunately it seems that failure data is not stored if it's not linked to an issue.

Thanks @AlitzelMendez for the above info.

carlossanlop avatar Jul 16 '24 18:07 carlossanlop

This test is failing a lot with 33 hits over the past 24 hours. We need to bring this back into 9.0.0, get it resolved, and plan to backport whatever change we make to the release/9.0 branch to clean up the failures there.

jeffhandley avatar Aug 21 '24 15:08 jeffhandley

I suspect it is this test

https://github.com/dotnet/runtime/blob/2694613ae7463000797c8229d004fc8172e19147/src/libraries/System.Runtime/tests/System.IO.Tests/MemoryStream/MemoryStreamTests.cs#L103-L107

It is already disabled and noted to be problematic in certain environments. I don't know how much memory the ADO containers have, but this test does a couple of 2GB allocations.

I suspect you are just hitting the CoreCLR version of this Mono failure. https://github.com/dotnet/runtime/issues/100225

vcsjones avatar Aug 29 '24 21:08 vcsjones

That is the only test in System.IO.Tests that does any significant memory allocation that I was able to observe.

vcsjones avatar Aug 29 '24 21:08 vcsjones

this test does a couple of 2GB allocations.

It's most likely one of the tests that causes the OOM 👍

But I am not sure that it's the only one:

  • https://github.com/dotnet/runtime/pull/88572 which added it was merged for .NET 9, while other branches including 6.0 fail too
  • this should manifest as a managed OOM that does not take the testing app down?

adamsitnik avatar Aug 30 '24 09:08 adamsitnik

this should manifest as a managed OOM that does not take the testing app down?

This test failure looks like the Linux OOM killer. The .NET process was able to allocate memory, but Linux shortly later ran out of memory. When that happens, Linux runs the OOM killer to start taking processes.

See https://www.kernel.org/doc/gorman/html/understand/understand016.html for more information.

The OOM killer decided that the .NET process was the right one to take down.

vcsjones avatar Aug 30 '24 10:08 vcsjones

@vcsjones thanks, I was not aware of that! (BTW it sucks as in a way it hides quite important information like stacktrace of the method that caused OOM)

adamsitnik avatar Aug 30 '24 10:08 adamsitnik

I think that test was contributing to the problem. The issue is still occurring in the release/9.0 branch which is why the 24-hour cell is not zero.

vcsjones avatar Sep 01 '24 02:09 vcsjones