autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Support in-place Pod vertical scaling in VPA

Open noBlubb opened this issue 4 years ago • 34 comments

Hey everyone,

as I gather the VPA currently cannot update pods without recreating them:

Once restart free ("in-place") update of pod requests is available from README

and neither can the GKE vertical scaler:

Due to Kubernetes limitations, the only way to modify the resource requests of a running Pod is to recreate the Pod from https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler#vertical_pod_autoscaling_in_auto_mode

Unfortunately, I was unable to learn the specific limitation from this (other than the mere absence of any such feature?) nor timeline for this to appear in VPA or how to contribute on this if possible. Could you please outline what is missing in VPA for this to be implemented?

Best regards, Raffael

noBlubb avatar Apr 15 '21 12:04 noBlubb

Would be nice with more details on the status feature. I would guess it's limitation in Kubernetes or from a lower level like containerd or kernel?

morganchristiansson avatar Apr 28 '21 08:04 morganchristiansson

At this moment this is a Kubernetes limitation (kernel and container runtime already supports resizing containers). There is work needed in scheduler, kubelet, core API so a pretty cross-cutting problem. Also a lot of systems assumed pod sized are immutable for a long time so there is need to untangle those as well.

There is ongoing work in Kubernetes to provide in-place pod resizes (Example: https://github.com/kubernetes/enhancements/pull/1883). Once that work completes VPA will be able to take advantage of that.

bskiba avatar Apr 28 '21 08:04 bskiba

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot avatar Jul 27 '21 11:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 26 '21 12:08 k8s-triage-robot

/remove-lifecycle rotten

Jeffwan avatar Aug 27 '21 05:08 Jeffwan

https://github.com/kubernetes/kubernetes/pull/102884

jmo-qap avatar Sep 15 '21 05:09 jmo-qap

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 27 '22 10:02 k8s-triage-robot

/remove-lifecycle rotten

jbartosik avatar Feb 28 '22 10:02 jbartosik

/remove-lifecycle stale

jbartosik avatar Feb 28 '22 10:02 jbartosik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 29 '22 11:05 k8s-triage-robot

/remove-lifecycle stale

jbartosik avatar May 31 '22 13:05 jbartosik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 29 '22 13:08 k8s-triage-robot

/remove-lifecycle stale

Support for in-place updates didn't make it into K8s 1.25 but it aiming for 1.26.

jbartosik avatar Sep 02 '22 11:09 jbartosik

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 01 '22 12:12 k8s-triage-robot

/remove-lifecycle stale Feature didn't make it in 1.26, but now targeted for 1.27 ;)

voelzmo avatar Dec 02 '22 09:12 voelzmo

This issue seems to be a duplicate of: https://github.com/kubernetes/autoscaler/issues/5046 Shouldn't we close one of those 2 issues ?

frivoire avatar Jan 05 '23 11:01 frivoire

https://github.com/kubernetes/kubernetes/pull/102884/ merged today. I will resume working on using that in VPA

@wangchen615 @voelzmo FYI

jbartosik avatar Feb 28 '23 10:02 jbartosik

/retitle "Support in-place Pod vertical scaling in VPA"

voelzmo avatar Mar 02 '23 11:03 voelzmo

args, don't take the quotes too literally, dear bot! 🙈

/retitle Support in-place Pod vertical scaling in VPA

voelzmo avatar Mar 02 '23 11:03 voelzmo

Also see https://github.com/kubernetes/kubernetes/issues/116214

sftim avatar Mar 02 '23 18:03 sftim

/kind feature

sftim avatar Mar 02 '23 18:03 sftim

(from https://github.com/kubernetes/autoscaler/issues/5046)

Describe the solution you'd like.: In the VPA updater, whenever existing logic decides to evict the pod, we should add a check on the pod spec to determine if NoRestart policy is enabled. If so, a patch request should be sent via updater directly to the pod without evicting it.

We'll need to decide how that might play out if there's an app container set to RestartNotRequired for memory and a sidecar container set to Restart for memory, and no special config for CPU.

Imagine that a vertical pod autoscaler decides to assign less memory to the sidecar, and that triggers a restart for the app - which isn't what the developer has intended. My imaginary developer was hoping that only the sidecar would get restarted when scaling down its memory request.

I think that any logic here needs to look at the container level, not just the pod.

sftim avatar Mar 02 '23 18:03 sftim

I don't think the VPA should look at the ResizePolicy field in PodSpec.containers at all. If I understand the KEP correctly, it is only meant as a hint towards kubelet what to do with the containers when an update to a containers resources needs to be applied.

I think VPA only needs to understand if the featureGate InPlacePodVerticalScaling is enabled and VPA should make use of it. If the feature is enabled and should be used, send a patch – otherwise evict like it is currently done. So the updater probably needs a flag to turn in-place Pod vertical scaling on or off, defaulting to off. As @jbartosik summarized there is potentially also the need to configure this on the VPA level as special workloads may opt for their own special treatment.

What happens then on a node is kubelet's job: restart a container, don't restart it, defer the update, etc... Does that make sense?

voelzmo avatar Mar 03 '23 10:03 voelzmo

Adding this as context, so we don't forget about this when implementing the feature: If we don't want to change existing behavior with injected sidecars, we need to find a way to achieve a similar thing like the admission-controller currently does to ignore injected sidecars when using in-place updates.

voelzmo avatar May 09 '23 08:05 voelzmo

There are some open issues related to the feature: https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+is%3Aopen+%5BFG%3AInPlacePodVerticalScaling%5D

Most relevant seem:

  • https://github.com/kubernetes/kubernetes/issues/114203 - until this is resolved VPA needs to decide an in-place update will not succeed after some time (maybe we need to do that even after this is resolved - we're not evicting other pods to make space for the one we want to scale up)
  • https://github.com/kubernetes/kubernetes/issues/112264 - we can't give up on in place updates too quickly, even successful ones take at least minute or so in my experience
  • https://github.com/kubernetes/kubernetes/issues/109553 - with this we can patch only subresource

jbartosik avatar Jun 21 '23 11:06 jbartosik

I don't think the VPA should look at the ResizePolicy field in PodSpec.containers at all.

API currently is limited and not supporting the notion of "apply changes if possible without restart and not apply otherwise". Which may impact PDB. I don't know how autoscaler deals with PDB today, but if there will be higher frequency autoscaling with InPlace update hoping for non disruptive change, this will not work. In other words, we either need a new API to resize ONLY without the restart or treat a resize as a disruption affecting PDB.

SergeyKanzhelev avatar Jul 25 '23 22:07 SergeyKanzhelev

@SergeyKanzhelev thanks for joining the discussion!

I don't know how [vertical pod] autoscaler deals with PDB today

Today, VPA uses the eviction API, which respects PDB.

we either need a new API to resize ONLY without the restart or treat a resize as a disruption affecting PDB.

I'm not sure which component the "we" part in this sentence is, but in general, I tend to agree with the need for an API that respects PDB. If kubelet needs to restart the Pod for applying a resource change, this should count towards PDB. However, I think this shouldn't be a concern that VPA has to deal with. Similarly to eviction, VPA should just be using an API that respects PDB if we consider this relevant for the restart case as well.

Regarding my statement from above

I don't think the VPA should look at the ResizePolicy field in PodSpec.containers at all.

This is no longer correct, as @jbartosik opted for a more informed approach in the enhancement proposal. Currently, VPA implements some constraints to ensure resource updates don't happen too frequently (for example, by requiring a mimimum absolute/relative change for Pods which have been running for shorter than 12 hours). The proposal contains the idea to change these constraints if a Container has ResizePolicy: NotRequired.

voelzmo avatar Jul 26 '23 11:07 voelzmo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 25 '24 07:01 k8s-triage-robot

/remove-lifecycle stale /lifecycle frozen

jbartosik avatar Jan 25 '24 08:01 jbartosik

Hi folks, could someone share a summary of what is blocking this feature please? +1 that this would be really useful to reduce workload evictions. Thank you!

nikimanoledaki avatar Sep 27 '24 12:09 nikimanoledaki