autoscaler VPA use rollout restart instead of eviction

Which component are you using?:

vertical-pod-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

The VPA cannot scale a Deployment with certain PodDisruptionBudgetConfigurations. A trivial example is a Deployment with a single replica, and a PodDisruptionBudget that doesn't allow an interruption. The vpa-updater tries to evict the single Pod for the Deployment, but fails to do so as the eviction will violate the PodDisruptionBudget. However, a rollout restart for the Deployment is possible, and would allow a resource configured like this to scale automatically.

Describe the solution you'd like.:

Use rollout restart for deployments when a PDB is preventing an eviction.

Describe any alternative solutions you've considered.:

Additional context.:

May 29 '24 15:05 akloss-cibo

/area vertical-pod-autoscaler

May 30 '24 09:05 adrianmoisey

The VPA cannot scale a Deployment with certain PodDisruptionBudgetConfigurations. A trivial example is a Deployment with a single replica, and a PodDisruptionBudget that doesn't allow an interruption. The vpa-updater tries to evict the single Pod for the Deployment, but fails to do so as the eviction will violate the PodDisruptionBudget.

@akloss-cibo, this is a misconfigured PDB. I assume that you might have hit the following K8s bug: https://github.com/kubernetes/kubernetes/issues/72320. For a long period of time there was a bug in K8s that does no allow eviction of unhealthy Pod. See https://kubernetes.io/docs/tasks/run-application/configure-pdb/#unhealthy-pod-eviction-policy. So, you might be hitting this issue. A fix is to .spec.unhealthyPodEvictionPolicy=AlwaysAllow if this is relevant for your workload. See https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md to properly understand what was the issue and how it gets fixed with .spec.unhealthyPodEvictionPolicy=AlwaysAllow.

The PDB motivation itself is not enough for such feature request. However, the feature request seems interesting and I assume some of the VPA maintainers can shred more light - I believe, it was a design decision to use the eviction + mutation approach over modifying the high level controller (Deployment, StatufulSet, ...) resources.

Jun 27 '24 08:06 ialidzhikov

No, not hitting https://github.com/kubernetes/kubernetes/issues/72320

https://github.com/prometheus-operator/prometheus-operator/issues/6291 seems to imply that the VPA does understand the controller relationship.

Other challenges we have with the VPA, when it is working as designed, are that you lose capacity right before the scale up happens; the worst case is you have two replicas, and the VPA decides to scale-up and evicts one; now you are at 50% of your pre-scale-up capacity until the new larger pod shows up.

We've been experimenting with this implementation of this feature with some success:

last_recommendation="none"
while (true) do
  recommendation="$( kubectl get vpa the_vpa -o template  --template='{{"{{index (index .status.recommendation.containerRecommendations 0) \"target\"  }}"}}' )"
  if test "$recommendation" != "$last_recommendation" ; then
    printf "Changing recommendation from %s to %s\n" "$last_recommendation" "$recommendation"
    kubectl rollout restart deployment the_deployment
    kubectl rollout status deployment the_deployment
    last_recommendation="$recommendation"
  fi
  sleep 60
done

Even this very naive implementation seems to work pretty well. It is naive because of the polling, checking only one container's recommendation, and assuming a Deployment rather than the other possible controllers, but it illustrates the idea.

Jun 27 '24 12:06 akloss-cibo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 25 '24 13:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 25 '24 14:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 24 '24 14:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 24 '24 14:11 k8s-ci-robot

autoscaler autoscaler copied to clipboard

VPA use rollout restart instead of eviction

autoscaler
autoscaler copied to clipboard