cf-deployment icon indicating copy to clipboard operation
cf-deployment copied to clipboard

Increase Staging RAM 1 → 2.5 GiB to avoid OOMs

Open cunnie opened this issue 1 year ago • 5 comments

The most-recent set of stemcells (Jammy 1.351+) have introduced intermittent OOM (out-of-memory) failures when staging Cloud Foundry applications. We believe that the error is caused by a poor interaction between the Linux 6.5 kernel and v1 control groups (cgroups) memory controller. Specifically, the kernel is not properly reclaiming memory.

To mitigate this, we're increasing the staging RAM, which greatly reduces the occurrence of the OOMs. One particular app, for example, would OOM 75% of the time during staging. After increasing the RAM limit, staging the app succeeded every time for thirty attempts.

We doubt increasing the staging RAM limit will have a negative impact unless the user is in the habit of restaging all their applications at the same time. The staging cycle is short-lived, and though staging an app will reserve a greater amount of RAM, that RAM will be released when the staging cycle completes within minutes.

There is an open GitHub issue. [0]

This staging RAM limit has not been updated in at least eight years. [1]

[0] https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/318

[1] https://github.com/cloudfoundry/cloud_controller_ng/blame/e8fb8f31d0687e17a71833dd685d50a3929c77e6/bosh/jobs/cloud_controller_ng/spec#L637

Please take a moment to review the questions before submitting the PR

🚫 We only accept PRs to develop branch. If this is an exception, please specify why 🚫

WHAT is this change about?

We want to reduce the disruption caused by OOMs (out-of-memory) errors during the staging of applications.

What customer problem is being addressed? Use customer persona to define the problem e.g. Alana is unable to...

With the advent of Jammy stemcell 1.351, users have begun experiencing OOM during the staging of apps. The staging fails, disrupting the user's workflow.

Please provide any contextual information.

  • GitHub issue: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/318

Has a cf-deployment including this change passed cf-acceptance-tests?

  • [ ] YES
  • [X] NO

Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.

  • [X] YES - please choose the category from below. Feel free to provide additional details.
  • [ ] NO
  1. increases VM footprint of cf-deployment - e.g. new jobs, new add ons, increases # of instances etc.

It increases the memory footprint of apps being staged. There is a corner-case where it may introduce errors: if the user restages all their applications at the same time, and the Diego cells are memory constrained, they may experience an "insufficient resources" error.

  1. changes the values

It increases the RAM staging limit 1 GiB → 2.5 GiB.

How should this change be described in cf-deployment release notes?

To address out-of-memory (OOM) errors during the staging process of apps on Jammy stemcells 1.351+, the staging RAM limit has been increased from 1 GiB to 2.5 GiB.

Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?

  • [ ] YES - please specify
  • [X] NO

Does this PR make a change to an experimental or GA'd feature/component?

  • [ ] experimental feature/component
  • [X] GA'd feature/component

Please provide Acceptance Criteria for this change?

A cf push does not fail during the staging process.

What is the level of urgency for publishing this change?

  • [ ] Urgent - unblocks current or future work
  • [X] Slightly Less than Urgent

Tag your pair, your PM, and/or team!

@jpalermo @aramprice

cunnie avatar Mar 04 '24 19:03 cunnie

I'll check this proposal together with the capi developers. Update: We'll discuss in today's TOC meeting how to proceed.

jochenehret avatar Mar 05 '24 07:03 jochenehret

This is a rather drastic memory regression. Are only staging containers affected or will this also lead to OOMs for running apps (that ran stable before).

I will raise this topic in the TOC meeting today. Maybe there are other solutions:

  • Provide an operations file instead of changing the default so that operators can (and have to) opt-in.
  • Is going back to LTS Kernel 5.15 an option (supported for 22.04 according to https://ubuntu.com/kernel/lifecycle)?

stephanme avatar Mar 05 '24 10:03 stephanme

@jochenehret :

We should provide a new ops file as opt-in solution for the increased memory limit.

Does "We" mean "me", as in, "Brian, please re-submit a PR using an opt-in solution using an ops file for the increased memory limit"?

Or does "We" mean "we-the-cf-deployment-approvers will take it from here; you (Brian) don't need to do anything)"?

cunnie avatar Mar 06 '24 14:03 cunnie

With "we" I meant the whole CF community ;-)

If you have time to update this PR with a separate ops file that would of course be great. Note that if you introduce a new ops file, you have to add a new unit test here (should be trivial): https://github.com/cloudfoundry/cf-deployment/blob/main/units/tests/standard_test/operations.yml

In case you absolutely don't have time to change the PR, I could also do it :-)

jochenehret avatar Mar 06 '24 15:03 jochenehret

The OOM regression should be fixed with the Ubuntu Jammy stemcell v1.404 release: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/releases/tag/ubuntu-jammy%2Fv1.404 This stemcell has been integrated into cf-deployment v39.5.0: https://github.com/cloudfoundry/cf-deployment/releases/tag/v39.5.0 We've already reverted the R buildpack test memory limit and didn't see any issues. Can we close this PR?

jochenehret avatar Mar 25 '24 08:03 jochenehret

Closing PR as a fixed stemcell is available.

jochenehret avatar Jun 04 '24 06:06 jochenehret