Increase Staging RAM 1 → 2.5 GiB to avoid OOMs
The most-recent set of stemcells (Jammy 1.351+) have introduced intermittent OOM (out-of-memory) failures when staging Cloud Foundry applications. We believe that the error is caused by a poor interaction between the Linux 6.5 kernel and v1 control groups (cgroups) memory controller. Specifically, the kernel is not properly reclaiming memory.
To mitigate this, we're increasing the staging RAM, which greatly reduces the occurrence of the OOMs. One particular app, for example, would OOM 75% of the time during staging. After increasing the RAM limit, staging the app succeeded every time for thirty attempts.
We doubt increasing the staging RAM limit will have a negative impact unless the user is in the habit of restaging all their applications at the same time. The staging cycle is short-lived, and though staging an app will reserve a greater amount of RAM, that RAM will be released when the staging cycle completes within minutes.
There is an open GitHub issue. [0]
This staging RAM limit has not been updated in at least eight years. [1]
[0] https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/318
[1] https://github.com/cloudfoundry/cloud_controller_ng/blame/e8fb8f31d0687e17a71833dd685d50a3929c77e6/bosh/jobs/cloud_controller_ng/spec#L637
Please take a moment to review the questions before submitting the PR
🚫 We only accept PRs to develop branch. If this is an exception, please specify why 🚫
WHAT is this change about?
We want to reduce the disruption caused by OOMs (out-of-memory) errors during the staging of applications.
What customer problem is being addressed? Use customer persona to define the problem e.g. Alana is unable to...
With the advent of Jammy stemcell 1.351, users have begun experiencing OOM during the staging of apps. The staging fails, disrupting the user's workflow.
Please provide any contextual information.
- GitHub issue: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/318
Has a cf-deployment including this change passed cf-acceptance-tests?
- [ ] YES
- [X] NO
Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.
- [X] YES - please choose the category from below. Feel free to provide additional details.
- [ ] NO
- increases VM footprint of cf-deployment - e.g. new jobs, new add ons, increases # of instances etc.
It increases the memory footprint of apps being staged. There is a corner-case where it may introduce errors: if the user restages all their applications at the same time, and the Diego cells are memory constrained, they may experience an "insufficient resources" error.
- changes the values
It increases the RAM staging limit 1 GiB → 2.5 GiB.
How should this change be described in cf-deployment release notes?
To address out-of-memory (OOM) errors during the staging process of apps on Jammy stemcells 1.351+, the staging RAM limit has been increased from 1 GiB to 2.5 GiB.
Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?
- [ ] YES - please specify
- [X] NO
Does this PR make a change to an experimental or GA'd feature/component?
- [ ] experimental feature/component
- [X] GA'd feature/component
Please provide Acceptance Criteria for this change?
A cf push does not fail during the staging process.
What is the level of urgency for publishing this change?
- [ ] Urgent - unblocks current or future work
- [X] Slightly Less than Urgent
Tag your pair, your PM, and/or team!
@jpalermo @aramprice
I'll check this proposal together with the capi developers. Update: We'll discuss in today's TOC meeting how to proceed.
This is a rather drastic memory regression. Are only staging containers affected or will this also lead to OOMs for running apps (that ran stable before).
I will raise this topic in the TOC meeting today. Maybe there are other solutions:
- Provide an operations file instead of changing the default so that operators can (and have to) opt-in.
- Is going back to LTS Kernel 5.15 an option (supported for 22.04 according to https://ubuntu.com/kernel/lifecycle)?
@jochenehret :
We should provide a new ops file as opt-in solution for the increased memory limit.
Does "We" mean "me", as in, "Brian, please re-submit a PR using an opt-in solution using an ops file for the increased memory limit"?
Or does "We" mean "we-the-cf-deployment-approvers will take it from here; you (Brian) don't need to do anything)"?
With "we" I meant the whole CF community ;-)
If you have time to update this PR with a separate ops file that would of course be great. Note that if you introduce a new ops file, you have to add a new unit test here (should be trivial): https://github.com/cloudfoundry/cf-deployment/blob/main/units/tests/standard_test/operations.yml
In case you absolutely don't have time to change the PR, I could also do it :-)
The OOM regression should be fixed with the Ubuntu Jammy stemcell v1.404 release: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/releases/tag/ubuntu-jammy%2Fv1.404 This stemcell has been integrated into cf-deployment v39.5.0: https://github.com/cloudfoundry/cf-deployment/releases/tag/v39.5.0 We've already reverted the R buildpack test memory limit and didn't see any issues. Can we close this PR?
Closing PR as a fixed stemcell is available.