autoscaler
autoscaler copied to clipboard
Support in-place Pod vertical scaling in VPA
Hey everyone,
as I gather the VPA currently cannot update pods without recreating them:
Once restart free ("in-place") update of pod requests is available from README
and neither can the GKE vertical scaler:
Due to Kubernetes limitations, the only way to modify the resource requests of a running Pod is to recreate the Pod from https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler#vertical_pod_autoscaling_in_auto_mode
Unfortunately, I was unable to learn the specific limitation from this (other than the mere absence of any such feature?) nor timeline for this to appear in VPA or how to contribute on this if possible. Could you please outline what is missing in VPA for this to be implemented?
Best regards, Raffael
Would be nice with more details on the status feature. I would guess it's limitation in Kubernetes or from a lower level like containerd or kernel?
At this moment this is a Kubernetes limitation (kernel and container runtime already supports resizing containers). There is work needed in scheduler, kubelet, core API so a pretty cross-cutting problem. Also a lot of systems assumed pod sized are immutable for a long time so there is need to untangle those as well.
There is ongoing work in Kubernetes to provide in-place pod resizes (Example: https://github.com/kubernetes/enhancements/pull/1883). Once that work completes VPA will be able to take advantage of that.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
https://github.com/kubernetes/kubernetes/pull/102884
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle rotten
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Support for in-place updates didn't make it into K8s 1.25 but it aiming for 1.26.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale Feature didn't make it in 1.26, but now targeted for 1.27 ;)
This issue seems to be a duplicate of: https://github.com/kubernetes/autoscaler/issues/5046 Shouldn't we close one of those 2 issues ?
https://github.com/kubernetes/kubernetes/pull/102884/ merged today. I will resume working on using that in VPA
@wangchen615 @voelzmo FYI
/retitle "Support in-place Pod vertical scaling in VPA"
args, don't take the quotes too literally, dear bot! 🙈
/retitle Support in-place Pod vertical scaling in VPA
Also see https://github.com/kubernetes/kubernetes/issues/116214
/kind feature
(from https://github.com/kubernetes/autoscaler/issues/5046)
Describe the solution you'd like.: In the VPA updater, whenever existing logic decides to evict the pod, we should add a check on the pod spec to determine if
NoRestartpolicy is enabled. If so, a patch request should be sent via updater directly to the pod without evicting it.
We'll need to decide how that might play out if there's an app container set to RestartNotRequired for memory and a sidecar container set to Restart for memory, and no special config for CPU.
Imagine that a vertical pod autoscaler decides to assign less memory to the sidecar, and that triggers a restart for the app - which isn't what the developer has intended. My imaginary developer was hoping that only the sidecar would get restarted when scaling down its memory request.
I think that any logic here needs to look at the container level, not just the pod.
I don't think the VPA should look at the ResizePolicy field in PodSpec.containers at all. If I understand the KEP correctly, it is only meant as a hint towards kubelet what to do with the containers when an update to a containers resources needs to be applied.
I think VPA only needs to understand if the featureGate InPlacePodVerticalScaling is enabled and VPA should make use of it. If the feature is enabled and should be used, send a patch – otherwise evict like it is currently done. So the updater probably needs a flag to turn in-place Pod vertical scaling on or off, defaulting to off.
As @jbartosik summarized there is potentially also the need to configure this on the VPA level as special workloads may opt for their own special treatment.
What happens then on a node is kubelet's job: restart a container, don't restart it, defer the update, etc...
Does that make sense?
Adding this as context, so we don't forget about this when implementing the feature: If we don't want to change existing behavior with injected sidecars, we need to find a way to achieve a similar thing like the admission-controller currently does to ignore injected sidecars when using in-place updates.
There are some open issues related to the feature: https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+%5BFG%3AInPlacePodVerticalScaling%5D
Most relevant seem:
- https://github.com/kubernetes/kubernetes/issues/114203 - until this is resolved VPA needs to decide an in-place update will not succeed after some time (maybe we need to do that even after this is resolved - we're not evicting other pods to make space for the one we want to scale up)
- https://github.com/kubernetes/kubernetes/issues/112264 - we can't give up on in place updates too quickly, even successful ones take at least minute or so in my experience
- https://github.com/kubernetes/kubernetes/issues/109553 - with this we can patch only subresource
I don't think the VPA should look at the
ResizePolicyfield inPodSpec.containersat all.
API currently is limited and not supporting the notion of "apply changes if possible without restart and not apply otherwise". Which may impact PDB. I don't know how autoscaler deals with PDB today, but if there will be higher frequency autoscaling with InPlace update hoping for non disruptive change, this will not work. In other words, we either need a new API to resize ONLY without the restart or treat a resize as a disruption affecting PDB.
@SergeyKanzhelev thanks for joining the discussion!
I don't know how [vertical pod] autoscaler deals with PDB today
Today, VPA uses the eviction API, which respects PDB.
we either need a new API to resize ONLY without the restart or treat a resize as a disruption affecting PDB.
I'm not sure which component the "we" part in this sentence is, but in general, I tend to agree with the need for an API that respects PDB. If kubelet needs to restart the Pod for applying a resource change, this should count towards PDB. However, I think this shouldn't be a concern that VPA has to deal with. Similarly to eviction, VPA should just be using an API that respects PDB if we consider this relevant for the restart case as well.
Regarding my statement from above
I don't think the VPA should look at the ResizePolicy field in PodSpec.containers at all.
This is no longer correct, as @jbartosik opted for a more informed approach in the enhancement proposal. Currently, VPA implements some constraints to ensure resource updates don't happen too frequently (for example, by requiring a mimimum absolute/relative change for Pods which have been running for shorter than 12 hours). The proposal contains the idea to change these constraints if a Container has ResizePolicy: NotRequired.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
Hi folks, could someone share a summary of what is blocking this feature please? +1 that this would be really useful to reduce workload evictions. Thank you!