dnceng icon indicating copy to clipboard operation
dnceng copied to clipboard

Insufficient memory of docker containers on CI

Open fanyang-mono opened this issue 2 years ago • 23 comments

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=351450

Build leg reported

Build / browser-wasm linux Release LibraryTests / Build product

Pull Request

https://github.com/dotnet/runtime/pull/89217

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "[error]Exit code 137 returned from process: file name '/usr/bin/docker'",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

  • [ ] Feature changes/additions
  • [ ] Bug fixes
  • [ ] Internal Infrastructure Improvements

Release Note Description

Additional information about the issue reported

No response

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450 Error message validated: [error]Exit code 137 returned from process: file name '/usr/bin/docker' Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 7/26/2023 2:43:39 PM UTC

fanyang-mono avatar Jul 25 '23 17:07 fanyang-mono

Hello @fanyang-mono, could you please update the "ErrorMessage" : "" by following the step by step documentation on how to create a known issue

andriipatsula avatar Jul 26 '23 08:07 andriipatsula

Updated.

fanyang-mono avatar Jul 26 '23 14:07 fanyang-mono

It's likely your process is using too much memory. Check to see when this started and if there were code changes around that time that could have caused this to occur.

https://www.airplane.dev/blog/exit-code-137

missymessa avatar Jul 27 '23 17:07 missymessa

@fanyang-mono, is this an infra issue? It looks like the errors are isolated to Runtime.

missymessa avatar Jul 27 '23 18:07 missymessa

@lewing Could you please confirm that this is a wasm build issue? This is the direct link to the build log https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450&view=logs&j=d4e38924-13a0-58bd-9074-6a4810543e7c&t=102a6595-1420-53fc-8f17-b0a3f4b1242a&l=5722

fanyang-mono avatar Jul 27 '23 18:07 fanyang-mono

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/352553/logs/541 is definitely not a wasm build issue

lewing avatar Jul 27 '23 20:07 lewing

exit code 127 typically means the process was sent a sig kill 128 + 9 = 137. Given that this is happening inside docker containers it is likely because they are hitting resource limits

lewing avatar Jul 27 '23 20:07 lewing

what are the limits on the cloudtest containers?

lewing avatar Jul 27 '23 20:07 lewing

Based on the tracking we're seeing failures across multiple unrelated lanes (although they tend to be llvm related lanes). This is going to continue to cause pain unless we can get some idea of which processes are using memory at the point that the container is killed.

lewing avatar Aug 12 '23 22:08 lewing

@missymessa It would be very helpful to know what the limits are on the container. We might be running too close to the limits, in which case it would be helpful to have those bumped up.

radical avatar Aug 12 '23 23:08 radical

@dotnet/dnceng this is causing considerable pain how should we escalate it? We can't diagnose the failures across multiple lanes and different runtimes without more detail.

lewing avatar Aug 13 '23 02:08 lewing

previous teams dealing w/ exit code 137 have worked w/ people on the runtime team to collect crash dumps and determine the root cause. it's also likely something changed in the runtime repo about a month ago that led to this issue.

dougbu avatar Aug 13 '23 02:08 dougbu

@dougbu the failures here are fairly random and span very different runtimes so a crash dump isn't likely to be deterministic. I would love to see the state of the container at shutdown time.

lewing avatar Aug 13 '23 02:08 lewing

cc @agocke for the nativeAOT failures

lewing avatar Aug 13 '23 02:08 lewing

@dougbu or edit the core information to retry, I can't

lewing avatar Aug 13 '23 02:08 lewing

also https://github.com/dotnet/runtime/issues/89402

lewing avatar Aug 13 '23 03:08 lewing

@lewing we don't have much to go on here. for one thing, we don't mess w/ "limits" in the Helix queues other than the file count maximum.

suggest you use the helix-repro-vms DevTest Labs to create a VM matching the queue used in your tests. then, do whatever you can to run the tests on that VM in a way that captures a dump. the dump should at least indicate what is causing the exit code. note the core dump should be created in the main process, not w/in the Docker container. I believe @agocke has experience using dumps to debug occasional build and test strangeness's.

we can increase whatever limit appears to be the problem, within limits.

dougbu avatar Aug 14 '23 23:08 dougbu

on test retries, please consider changing your eng/test-configuration.json file. that's documented in https://github.com/dotnet/arcade/blob/d3b8861e20aaf0179034c6076d156e2442b26f9b/src/Microsoft.DotNet.Helix/Sdk/Readme.md#test-retry and dotnet/runtime's file already automatically retries based on a handful of error messages

dougbu avatar Aug 14 '23 23:08 dougbu

oh, btw, if it's a true memory restriction as dotnet/runtime#89402 was, we might be able to bump things up. however there might not be budget and the problem certainly isn't related to a decrease in anything on our side. more likely the test count or memory footprint went up before this issue was observed. if that's the case, the most straightforward fix would be to split a large test project in two

dougbu avatar Aug 14 '23 23:08 dougbu

According to the table, linux-x64 Mono LLVMFullAot RuntimeTests lane also ran out of memory of the docker container during AOT very often.

fanyang-mono avatar Aug 24 '23 15:08 fanyang-mono

Should this be moved to the runtime repo since it only affects that repo, especially since we're waiting for additional information while they check a repro vm?

riarenas avatar Aug 24 '23 15:08 riarenas

@lewing please move this to the runtime repo (and, perhaps, work using the helix-repro-vms to narrow the issue down). when you've found a specific action to take, please describe it in the First Responders channel. we may have a way to bump limits but it's more likely the runtime team will need to reduce or simplify something to resolve this issue.

dougbu avatar Nov 23 '23 00:11 dougbu

ping @lewing. we're still hitting this problem occasionally but I'm not seeing anything outside runtime builds. there might be some change we could make but we don't have any information on our side. if you have a suggestion…

dougbu avatar Jan 08 '24 21:01 dougbu