kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

[InPlacePodVerticalScaling] Safer memory limit decrease

Open natasha41575 opened this issue 3 weeks ago • 4 comments

Today, the when memory limits are decreased, the kubelet performs a best-effort OOM-kill prevention by first checking whether the new limits are below the current usage, and blocking the resize if so. This is subject to a TOCTOU race condition.

We can reduce the TOCTOU race window by moving the check into the runtime.

See https://github.com/kubernetes/enhancements/pull/5368 for further discussion.

/sig node /kind feature

cc @tallclair

natasha41575 avatar Dec 08 '25 20:12 natasha41575

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Dec 08 '25 20:12 k8s-ci-robot

Hi @natasha41575 , I'd like to work on this.

I see that the resize logic is handled in doPodResizeAction (in pkg/kubelet/kuberuntime/kuberuntime_container.go).

I can work on removing the best-effort memory check there to mitigate the TOCTOU race condition.

/assign

AbhinavPInamdar avatar Dec 10 '25 05:12 AbhinavPInamdar

Hi @natasha41575,

Regarding the approach to 'move the check into the runtime, I see two potential paths:

  1. Minimize the Race: Move the memory validation in Kubelet to immediately before the CRI call (reducing the window to microseconds).

  2. Delegate Completely: Remove the Kubelet-side check entirely and rely on the container runtime (and underlying cgroups) to reject the update if memory is insufficient.

Is the goal to strictly delegate this responsibility to the runtime, or is Kubelet expected to retain some validation logic?

AbhinavPInamdar avatar Dec 10 '25 09:12 AbhinavPInamdar

@AbhinavPInamdar We'd want to move the check into the container runtime completely. There is more discussion here:

  • https://github.com/kubernetes/enhancements/pull/5368#discussion_r2136425441
  • https://github.com/kubernetes/enhancements/pull/5368#discussion_r2152916541
  • https://github.com/kubernetes/enhancements/pull/5368#issuecomment-2992831679

This will require a more thorough design (i.e. a KEP), to decide what the changes to the CRI look like, with agreement from containerd and crio maintainers.

natasha41575 avatar Dec 10 '25 17:12 natasha41575

I understand that a KEP is required to formalize the CRI changes and coordinate with containerd/CRI-O maintainers.

My initial technical plan involves:

Atomic Validation: Standardizing a requirement in CRI for runtimes to check usage < limit immediately before the cgroup write (eliminating the race window from milliseconds to microseconds).

Typed Errors: Defining a specific gRPC error code (e.g., codes.FailedPrecondition) for this failure mode, allowing Kubelet to distinguish usage violations from generic runtime errors and set the PodResizeInfeasible condition correctly.

Leveraging Existing Infrastructure: The OCI cgroups library already has MemoryCheckBeforeUpdate functionality , we just need to standardize its use in CRI and wire up proper error handling.

If you or @tallclair or other sig-node maintainers are willing to sponsor/shepherd this KEP, I can start with a detailed proposal that includes:

  • CRI specification changes
  • Reference implementation guidance for containerd/CRI-O
  • Kubelet error handling updates
  • Comprehensive test plan

I believe this addresses the core concern raised by @haircommander about moving validation closer to the kernel while building incrementally on the v1.34 implementation.

AbhinavPInamdar avatar Dec 13 '25 12:12 AbhinavPInamdar

An alternative to consider (for the KEP): phased shrinking.

eg: reduce the memory limit in n steps, maximum 64. At each step check that current usage is below the limit.

On failure (usage is above the next step's limit), revert to the pre-resize limit and reject it.

The repeated checking and phased shrink could be in addition to the change where we delegate to a container runtime.


Bear in mind that some runtimes may not be able to guarantee low-latency checks on current memory use, and that we shouldn't assume that Pods being resized are running Linux.

lmktfy avatar Dec 13 '25 21:12 lmktfy

I agree that phased shrinking is a valuable strategy for inducing memory reclaim gracefully. It would definitely improve the safety profile, especially on non-Linux platforms where atomic checks might not be feasible.

However, as you noted, this would be additive. My concern is that without a fundamental atomic check at the runtime level, phased shrinking still leaves a race window at the final step of the reduction. If the workload spikes during that last increment, we still hit the TOCTOU issue.

I plan to scope this KEP to establishing the FailedPrecondition error contract and the atomic check mechanism first. I can add 'Phased Shrinking' to the 'Future Work' or 'Alternatives' section of the KEP as a complementary safety layer. Does that alignment make sense?"

AbhinavPInamdar avatar Dec 14 '25 08:12 AbhinavPInamdar

That's exactly what I'd hoped to suggest: something to list as one of the alternatives that we had considered.

Writing down what we thought about, but ruled out, is useful and a helpful part of enhancement design.

lmktfy avatar Dec 14 '25 09:12 lmktfy

Thanks @lmktfy. I will ensure the alternatives section of the KEP explicitly details the phased shrinking approach and the trade-offs that led us to prioritize the atomic runtime check.

I'll get to work on the KEP draft now and share a Google Doc link here soon for early feedback

AbhinavPInamdar avatar Dec 14 '25 17:12 AbhinavPInamdar

@AbhinavPInamdar please correct if I'm mistaken, but checking your github history you seem very new to the kubernetes community. This issue is a pretty involved and advanced one (one that I was considering working on myself in the next couple of releases), and I think it could be a better experience and perhaps more helpful for you to start with something smaller and more accessible for new contributors. These are typically marked with "good first issue" and/or "help wanted" labels (see https://github.com/kubernetes/kubernetes/issues?q=state%3Aopen%20label%3A%22good%20first%20issue%22 for example). WDYT?

/unassign @AbhinavPInamdar /assign

natasha41575 avatar Dec 15 '25 14:12 natasha41575

Hi @natasha41575,

You're right, I am new to the Kubernetes org, and I respect your decision to prioritize stability on such a complex issue. I definitely don't want to become a bottleneck for the release.

That said, I’ve spent the last few days analyzing the TOCTOU race mechanism and drafting a KEP structure. I would also be very keen to shadow this effortm perhaps by handling specific tasks like test coverage, documentation, or the Kubelet-side error mapping under your guidance? It would be a great way for me to learn the ropes on a substantial issue without carrying on alone.

AbhinavPInamdar avatar Dec 15 '25 14:12 AbhinavPInamdar

I'm happy to move this check closer to the runtime, although I wonder if we'd want to plumb it all the way down into runc in that case?

When we were originally exploring this, there were some questions about whether we could leverage memory.high to eliminate the race. Basically:

  1. Check usage, if usage > new limit, abort.
  2. Set memory.high = new limit.
  3. Check usage again:
    • usage >= new limit: revert memory.high && abort
    • usage < new limit: set memory.max, then revert memory.high

Setting memory.high will throttle allocation if it's over the limit, but if we revert it quickly that's probably preferable to an OOMkill. However, I remember reading somewhere that memory.high and memory.max are not intended to be used together like this, so more investigation is needed.


Regarding phased shrinking, that's the road we'll need to take if we want to induce memory pressure to trigger reclaim (setting memory.reclaim first). This can be used to "squeeze" an applications usage down to the new limit. The incremental shrinking is important to avoid long kernel pauses and timeouts (which can trigger oomkill).

However, I think memory.current, which we read for usage, includes reclaimable memory. So, I don't see an advantage to using incremental shrinking above the memory.current value?

/cc @roycaihw

tallclair avatar Dec 17 '25 00:12 tallclair