runtime
runtime copied to clipboard
System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137
The System.IO.Net5Compat.Tests and the System.IO.Tests test processes are intermittengly getting killed on Linux shortly after starting, and the exit code is 137.
Build Information
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407 Build error leg or test failing: System.IO.Net5Compat.Tests
Error Message
{
"ErrorPattern": ["Starting: System\\.IO\\.(Net5Compat\\.)?Tests", "exit code 137"],
"BuildRetry" : true,
"ExcludeConsoleLog" : false
}
System.IO.Net5Compat.Tests example
- PR: https://github.com/dotnet/runtime/pull/100545
- Queue:
Libraries Test Run release coreclr Linux_musl x64 Debug - Job result: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407&view=logs&j=73b7db51-6fbc-594d-25c9-15b2737b0bf6
- Log file: https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-runtime-refs-pull-100545-merge-f07349d435f3464e8b/System.IO.Net5Compat.Tests/1/console.71f42a07.log?helixlogtype=result
- Output:
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
Discovering: System.IO.Net5Compat.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.Net5Compat.Tests (found 679 of 685 test cases)
Starting: System.IO.Net5Compat.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 25 Killed "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.Net5Compat.Tests.runtimeconfig.json --depsfile System.IO.Net5Compat.Tests.deps.json xunit.console.dll System.IO.Net5Compat.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Apr 2 20:20:02 UTC 2024 ----- exit code 137 ----------------------------------------------------------
System.IO.Test example
- PR: https://github.com/dotnet/runtime/pull/100545
- Queue:
Libraries Test Run release mono Linux x64 Debug - Job result: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407&view=logs&j=b4344b0d-0f92-5d69-ccaf-e0b24fbf14a2&t=65a58faa-a542-5344-b5c8-897139f242cd&l=888
- Log file: https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-runtime-refs-pull-100545-merge-f07349d435f3464e8b/System.IO.Tests/1/console.effe990e.log?helixlogtype=result
- Output:
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
Discovering: System.IO.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.Tests (found 679 of 685 test cases)
Starting: System.IO.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 25 Killed "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.Tests.runtimeconfig.json --depsfile System.IO.Tests.deps.json xunit.console.dll System.IO.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Apr 2 20:20:10 UTC 2024 ----- exit code 137 ----------------------------------------------------------
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407
Error message validated: [Starting: System\.IO\.(Net5Compat\.)?Tests exit code 137]
Result validation: :white_check_mark: Known issue matched with the provided build.
Validation performed at: 4/2/2024 11:08:28 PM UTC
Report
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 9 | 60 | 298 |
Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.
@dotnet/area-system-io there are a lot of hits on this and relatively recent. It seems to me to be happening across many configurations. I think it's worth having a look.
The System.IO.Net5Compat.Tests and the System.IO.Tests test processes are intermittengly getting killed on Linux shortly after starting, and the exit code is 137.
137 means out of memory. We have not made any changes to 6.0 in System.IO, so I expect that either there was some infra change (like less memory available) or a bug was introduced in the product itself. The bug would be specific to Linux.
@carlossanlop is it possible to perform some kind of binary search based on the merged PRs and when it started to fail?
@adamsitnik @Jozkee This is one of the most impactful failures in servicing. It only affects System.IO.Tests and System.IO.Net5Compat.Tests. Any chance you can take a look soon?
@carlossanlop sure, but could you please answer the question I've asked in https://github.com/dotnet/runtime/issues/100558#issuecomment-2051457144 ?
Sorry, I missed that question. Yes, you can use Kusto. David has used it many times in the past.
This is the super basic kusto query you can execute if looking via issue:
TestKnownIssues
| union KnownIssues
| where IssueId == ""
This database stores data from the last 4 months so hopefully there's still info from April.
This is the cluster where you would look for that info: https://dataexplorer.azure.com/clusters/dotnetperf.westus/databases/PerformanceData
Unfortunately it seems that failure data is not stored if it's not linked to an issue.
Thanks @AlitzelMendez for the above info.
This test is failing a lot with 33 hits over the past 24 hours. We need to bring this back into 9.0.0, get it resolved, and plan to backport whatever change we make to the release/9.0 branch to clean up the failures there.
I suspect it is this test
https://github.com/dotnet/runtime/blob/2694613ae7463000797c8229d004fc8172e19147/src/libraries/System.Runtime/tests/System.IO.Tests/MemoryStream/MemoryStreamTests.cs#L103-L107
It is already disabled and noted to be problematic in certain environments. I don't know how much memory the ADO containers have, but this test does a couple of 2GB allocations.
I suspect you are just hitting the CoreCLR version of this Mono failure. https://github.com/dotnet/runtime/issues/100225
That is the only test in System.IO.Tests that does any significant memory allocation that I was able to observe.
this test does a couple of 2GB allocations.
It's most likely one of the tests that causes the OOM 👍
But I am not sure that it's the only one:
- https://github.com/dotnet/runtime/pull/88572 which added it was merged for .NET 9, while other branches including 6.0 fail too
- this should manifest as a managed OOM that does not take the testing app down?
this should manifest as a managed OOM that does not take the testing app down?
This test failure looks like the Linux OOM killer. The .NET process was able to allocate memory, but Linux shortly later ran out of memory. When that happens, Linux runs the OOM killer to start taking processes.
See https://www.kernel.org/doc/gorman/html/understand/understand016.html for more information.
The OOM killer decided that the .NET process was the right one to take down.
@vcsjones thanks, I was not aware of that! (BTW it sucks as in a way it hides quite important information like stacktrace of the method that caused OOM)
I think that test was contributing to the problem. The issue is still occurring in the release/9.0 branch which is why the 24-hour cell is not zero.