autoscaler Feature Request: Cluster Autoscaler - scale up Deployment if PCB doesn't meet the scale down/drain requirements

I'm looking over the cluster-autoscaler source to find where this could be added, not sure I have the go chops to submit a PR, but wanted to share this as a feature request.

I have a very RAM heavy microservice java app (20gb no load), in prod we'll run two of every service for HA. However our dev/test/stage environments it would be too expensive to run more than 1 of each service.

I have each microservice in a deployment, with a Horizontal Pod Autoscaler to handle usage spikes / performance testing.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: service-autoscaler
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: service
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 700

In initial deployment in the dev environment, the cluster autoscaler caused significant occasional downtime as it balanced the cluster. These Java microservices sometimes take 4 minutes to become ready. So to avoid this, I applied a Pod Disruption Budget on each service.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: service-budget
  namespace: app
spec:
  minAvailable: 1
  selector:
    matchLabels:
      run: service

So far so good. However since the Deployments have only 1 replica, this permanently blocks the cluster autoscaler from scaling down.

There is an issue on Kubernetes relating to the kubectl drain in this scenario, but it's been closed due to inactivity. Seems like this use case for pdb wasn't recommended because it isn't HA with 1 replica. The argument being, Kubernetes should make 1 replica as available as possible were downtime even without a pdb should be avoided at all costs.

The feature as follows.

Check for pdb
If a pdb is blocking a drain, scale the deployment/pod up until it meets the pdb criteria, ensuring the new pods are scheduled on ideal nodes.
Wait for new Pods to become ready
Drain the node

Maybe this could result in a new command argument that forces this behavior on all Deployments with 1 replica, without the need for a pdb.

As I search for a place to implement this, I found that cluster-autoscaler/simulator/drain.go could be a starting place. From there I'm not sure how this could be implemented, but if there's any guidance where I could begin testing an implementation I'll gladly attempt to figure this out.

One issue I see off the get go is determining if the pod is attached to a deployment, and if there is a horizontal autoscaler to read from to determine if scaling up is acceptable.

Oct 04 '18 18:10 AshMartian

If there is an HPA scaling the deployment in question, I think this would mean that cluster autoscaler and Horizontal Pod Autoscaler would have to communicate, as otherwise HPA would possibly scale down the deployment that CA just scaled up to be able to drain the node and not violate the PDB. @MaciekPytel @aleksandra-malinowska do you think there is another way to accommodate this use case?

Oct 08 '18 11:10 bskiba

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Jan 06 '19 12:01 fejta-bot

/remove-lifecycle stale

Can we keep it fresh please?

Jan 07 '19 19:01 alonl

Two thoughts:

If any downtime of the application is unacceptable, even a couple of minutes, running only a single replica of it is very risky. I'd suggest reconsidering this setup before it fails.
Supposing that for improving the availability of a best-effort application this behavior could be useful, I wonder if it could be incorporated into eviction. Once eviction object is created while PDB is satisfied, owner controller could start a new replica while the old one is still in graceful termination period. That way, the problem would be solved regardless of HPA and kubectl drain would benefit from it as well.

Jan 07 '19 19:01 aleksandra-malinowska

I think this is most useful in a non-prod environment. As @blandman noted, his use case for wanting this feature is for a dev environment where it might be too costly to run multiple replicas of an app, coupled with the need for for autoscaling the cluster.

Our use case is similar. We have a non-prod kubernetes cluster in AWS, complete with the cluster autoscaler. We want the cluster autoscaler to be able to aggressively scale in nodes, but we need it to do so with as little downtime as possible.

It would be nice if the pod disruption budget had a maxSurge property similar to rolling updates, and then if there was a voluntary eviction, Kubernetes can make use of the max surge in order to scale temporarily, re-locate the pod, and then continue with the scale down of the nodes.

Mar 20 '19 20:03 Bwvolleyball

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Jun 18 '19 21:06 fejta-bot

/remove-lifecycle stale

Jun 18 '19 22:06 StevenACoffman

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Sep 16 '19 22:09 fejta-bot

/remove-lifecycle stale

Sep 28 '19 13:09 2ZZ

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Dec 27 '19 14:12 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

Jan 26 '20 15:01 fejta-bot

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

May 24 '20 22:05 fejta-bot

/remove-lifecycle stale

May 25 '20 06:05 frittentheke

This sort of logic is also useful in real production scenarios. If I have 2 pods close to targetAverageUtilization and I start a drain, cutting one off means I'm now close to 50% underprovisioned and will see severe degradation until the HPA kicks in.

One solution is to run a high enough minReplicas, but I'd love Kubernetes to help me not spend more money at low traffic / idle.

A more comprehensive feature that could help is, in pseudo-code:

drainBlocked = pdb.maxUnavailable == 0 || pdb.minAvailable == hpa.currentReplicas
canScaleUp = hpa.maxReplicas > hpa.currentReplicas

if drainBlocked and canScaleUp
  newPod = scaleUp(deploy)
  waitForReady(newPod, SOME_TIMEOUT)
  terminate(oldPod)
else
  fail "drain blocked"

In plain English, scale up and ensure readiness before terminating when PDB is blocking up a scale-up would be allowed by an HPA. Only block a drain when at maxReplicas.

This way we have a way to communicate the desire for no degraded capacity or no downtime for sensitive applications (maxUnavailable: 0 or minAvailable == currentReplicas), yet we keep the ability to completely block drains for applications where we simply can't tolerate voluntary disruptions or that can't tolerate multiple coexisting instances (by running them at maxReplicas). The desire to cover the former comes from mine and blandman's scenarios. The desire to cover the latter can be inferred by the fact the PDB docs cover it, and I've seen applications that can't tolerate multiple coexisting instances (Metabase comes to mind, not sure if still true).

Aug 20 '20 21:08 omnibs

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Nov 18 '20 22:11 fejta-bot

/remove-lifecycle stale

The logic I describe above is also useful for binpacking. Binpacking happens a lot in the ramp down of traffic, and it's more likely that it will catch workloads at low Pod count.

Terminating a Pod to reschedule it for binpacking when we have 2 pods means we lose 50% compute capacity for a period of time and will cause degradation for workloads running at anything close to or above 50% CPU utilization.

Having the ability to scale up to respect maxUnavailable or minAvailable means we can run workloads with minAvailable: 2 and still binpack effectively at idle, with no service degradation.

Nov 19 '20 11:11 omnibs

This would be particularly useful for us. We have a similar issue with a RAM heavy app. Running replicas at prod is fine but its wasteful in Dev/Test as we dont need the availability.

I wish you could specify with a PDB that during drain use a similar strategy to RollingUpdate in deployment. Old pod isnt removed until a new one is created and ready.

https://github.com/kubernetes/kubernetes/issues/66811 been discussed here and this is tricky due to the disconnected structure of each of the components of kubernetes. I.E scheduler has no idea about the PDB or availabilty config, can only be reactive not anticipate a pod disappearing.

Feb 02 '21 09:02 GerryWilko

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

May 03 '21 10:05 fejta-bot

/remove-lifecycle stale

May 03 '21 12:05 frittentheke

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Aug 01 '21 13:08 k8s-triage-robot

/remove-lifecycle stale

Aug 01 '21 13:08 MattJeanes

We have also a scenario, where it's not absolutely required to have 100% uptime, but we would like to keep disruption as low as possible.

We are serving websites from a rails application, where each puma server consumes about 700Mi - with usually 0 load since these websites are used not frequently. So it would be nice to have a pod disruption budget which keeps the minimum available pods to at least 1 and preserving uptime while our nodes can be scaled down automatically.

Would be great if this could be implemented.. Is there any chance that this gets done?

Sep 06 '21 22:09 h0jeZvgoxFepBQ2C

@h0jeZvgoxFepBQ2C If you are happy to run 2 pods replica at all times then this is already possible. Set two replicas or more on your deployment and a pod disruption budget of 1 min available.

Sep 07 '21 06:09 GerryWilko

@GerryWilko This is what we do right now, but since most of our servers have zero load most of the time (we are working in the event business where our plattform is only used for 5-7 days actively and the servers run usually 6-8 months still afterwards for data queries and statistics - but with nearly zero load and active users), we would like to reduce our RAM usage by the half, which would help massively to reduce the needed budget to run our cluster.

We tried to accomplish what the issue author also tried, to set the minAvailable: 1, but then it stopped to scale down our nodes due to the strict PDB budget.

Sep 07 '21 07:09 h0jeZvgoxFepBQ2C

@h0jeZvgoxFepBQ2C Yep so you are running into the same limitations listed here. If you don't need 100% uptime why don't you just remove the disruption budget entirely? Does your service take a long time to start?

Mostly shuffling around of pods, causing a drop in your service, should only happen during deployments or node pool upgrades which should be infrequent.

Sep 07 '21 08:09 GerryWilko

@GerryWilko Yes you are correct, the reason why we still would like to have a PDB is to increase availability during deployments (our app takes around 90 seconds to start, so we would be offline during the rollout or node downscaling). Also our nodes are scaling up/down quite often, so we would at least try to keep the downtime to a minimum (but our budget concerns are more important :D)

Sep 07 '21 08:09 h0jeZvgoxFepBQ2C

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 06 '21 09:12 k8s-triage-robot

/remove-lifecycle stale

still an important feature request

Dec 06 '21 09:12 h0jeZvgoxFepBQ2C

Looking at the code of the cluster autoscaler it doesn't seem that hard to implement the scale up of either ReplicaSet, Deployment or HPA to ensure fulfilment of a PCB.

A problem occurs later when you need to scale down the resource again as you the delete the pods on the node to be deleted: Either you scale down first, with the risk that the newly created pods are deleted by the RS or you delete the pods first causing the RS to spin up new pods before the scale down.

You ought to be able to fix that using the new Pod deletion cost feature:

https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost
https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/2255-pod-cost/kep.yaml

With that you you should be able to set the controller.kubernetes.io/pod-deletion-cost of the pods you want to delete to the lowest possible number first and then the scale down of the RS should make these pods terminate. The problem is that in the latest version of Kubernetes (1.23) this feature is still in beta.

If the feature of this issue is implemented I guess the reasonable route would be to have it enabled through a command line flag and if the Kubernetes version is lower than 1.24 (the projected version for this feature to be stable in) a warning is printed. Also the documentation (--help output and so on) should be clear on the need of the pod deletion cost feature.

Jan 17 '22 18:01 msvticket

For the common edge case of all pods in a deployment residing on the node to be removed (as in the use cases above where you only want one replica) there would be a much simpler solution: essentially do kubectl rollout restart deployment. That is implemented by adding an annotation like kubectl.kubernetes.io/restartedAt: "2021-02-15T11:12:54-05:00" to the template, which triggers the creation of a new replica set.

Jan 17 '22 18:01 msvticket

autoscaler autoscaler copied to clipboard

Feature Request: Cluster Autoscaler - scale up Deployment if PCB doesn't meet the scale down/drain requirements

autoscaler
autoscaler copied to clipboard