karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Consolidation ttl: `spec.disruption.consolidateAfter`

Open runningman84 opened this issue 2 years ago • 56 comments

Tell us about your request

We have a cluster where there are a lot of cron jobs which run every 5 minutes...

This means we have 5 nodes for our base workloads and every 5 minutes we get additional nodes for 2-3 minutes which are scaled down or consolidated with existing nodes.

This leads to a constant flow of nodes joining and leaving the cluster. It looks like the docker image pull and node initialization creates more network traffic fees than the cost reduction of not having running the instances all the time.

It would be great if we could configure some time consolidation period maybe together with ttlSecondsAfterEmpty which would only cleanup or consolidate nodes if the capacity was idling for x amount of time.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Creating a special provisioner is quite time consuming because all app deployments have to be changed to leverage it...

Are you currently working around this issue?

We think about putting cronjobs into a special provisioner which would not use consolidation but the ttlSecondsAfterEmpty feature.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

runningman84 avatar Dec 19 '22 08:12 runningman84

We've talked about this a fair bit -- I think it should be combined/collapsed w/ ttlSecondsAfterEmpty.

ellistarn avatar Dec 19 '22 17:12 ellistarn

The challenge with this issue is more technical than anything. Computing ttlSecondsAfterEmpty is cheap, since we can cheaply compute empty nodes. Computing a consolidatable node requires a scheduling simulation across the rest of the cluster. Computing this for all nodes is really computationally expensive. We could potentially compute this once on the initial scan, and again once the TTL is about to expire. However, this can lead to weird scenarios like:

  • t0, node detected underutilized, enqueued to 30s TTL
  • t0+25s, pod added somewhere else in the cluster, making the node no longer consolidatable
  • t0+29s, pod removed somewhere else in the cluster, making the node consolidatable again, should restart TTL
  • t0+30s, node is consolidated, even though it's only been consolidatable for 1 second.

The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?

ellistarn avatar Dec 19 '22 17:12 ellistarn

The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?

I'm a little unclear on this and I think it's in how I'm reading not in what you've said. What I think I am reading is that running the consolidatability on every single pod creation/deletion is to expensive. As an alternative the algorithm above is acceptable but in some cases could result in node consolidation in 'less than' TTLSecondsAfterConsolidatable due to fluctuation in cluster capacity between initial check (t0) and confirmation check/ (t0+30s in the example).

Have I understood correctly?

kylebisley avatar Dec 19 '22 21:12 kylebisley

Yeah exactly. Essentially, the TTL wouldn't flip flop perfectly. We'd be taking a rough sample (rather than a perfect sample) of the data.

ellistarn avatar Dec 19 '22 22:12 ellistarn

Thanks for the clarity. For my usage I'd not be concerned about the roughness of the sample. As long as there was a configurable time frame and the confirmation check needed to pass both times I'd be satisfied.

What I thought I wanted before being directed to this issue was to be able to specify how the consolidator was configured a bit like the descheduler project because I'm not really sure if the 'if it fits it sits' approach to scheduling is what I need in all cases.

kylebisley avatar Dec 20 '22 02:12 kylebisley

Specifically, what behavior of descheduler did you want?

ellistarn avatar Dec 20 '22 18:12 ellistarn

Generally I was looking for something like the deschedulerPolicy.strategies config block which I generally interact through the helm values file. More specifically I was looking for deschedulerPolicy.strategies.LowNodeUtalization.params.nodeResourceUtilizationThresholds targetThresholds and thresholds.

kylebisley avatar Dec 21 '22 21:12 kylebisley

Related: https://kubernetes.slack.com/archives/C02SFFZSA2K/p1675103715103399

ellistarn avatar Jan 30 '23 18:01 ellistarn

To give another example of this need, I have a cluster that runs around 1500 pods - there are lots of pods coming in and out at any given moment. It would be great to be able to specify a consolidation cooldown period so that we are not constantly adding/removing nodes. Cluster Autoscaler has the flag --scale-down-unneeded-time that helps with this scenario.

c3mb0 avatar Feb 09 '23 13:02 c3mb0

is it feature available yet?

sichiba avatar Feb 24 '23 14:02 sichiba

We are facing same issue with high node rotation due too aggressive consolidation, would be nice to tune and control the behaviour, like minimum node ttl liveness, thresshold ttl since it's empty or underutilisation, merging nodes

agustin-dona-peya avatar Mar 23 '23 13:03 agustin-dona-peya

cluster-autoscaler has other options too like:

--scale-down-delay-after-add, --scale-down-delay-after-delete, and --scale-down-delay-after-failure flag. E.g. --scale-down-delay-after-add=5m to decrease the scale down delay to 5 minutes after a node has been added.

I'm looking forward to something like scale-down-delay-after-add to pair with consolidation. Our hourly cronjobs are also causing node thrashing.

calvinbui avatar Apr 13 '23 01:04 calvinbui

Another couple of situations that currently lead to high node churn are:

  • A "high impact" rollout across various namespaces or workloads results in a large amount of resources being allocated. This spike in allocation is temporary but Karpenter will provision new nodes as a result. After the rollout is complete, capacity will normally return back to normally resulting in a consolidation attempt. This means node TTLs can be ~10-15 minutes depending on how long the rollout takes.
  • A batch of cronjobs that are scheduled at the same time and have a specific set of requests. This will also likely result in creating new node(s). Once the jobs are complete, there will likely be free capacity that will prompt Karpenter will consolidate.

In both situation above, we end up in situations where some workloads will end up being restarted multiple times within a short time frame due to node churn and if not enough replicas are configured with sufficient anti-affinity/skew, there is a chance for downtime to occur while pods become ready once again on new nodes.

It would be nice to be able to control the consolidation period, say every 24 hours or every week as described by the OP so it's less disruptive. Karpenter is doing the right thing though!

I suspect some workarounds could be:

  • simply provisioning additional capacity to accommodate rollouts
  • possible use of nodeSelectors for the scheduled jobs to run on without impacting other longer running workloads

Any other ideas or suggestions appreciated.

tareks avatar Apr 13 '23 18:04 tareks

Adding here as another use case where we need better controls over consolidation, esp. around utilization. For us, there's a trade-off between utilization efficiency and disruptions caused by pod evictions. For instance, let's say I have 3 nodes, each utilized at 60%, so current behavior is Karpenter will consolidate down to 2 nodes at 90% capacity. But, in some cases, evicting the pods on the node to be removed is more harmful than achieving optimal utilization. It's not that these pods can't be evicted (for that we have the do-not-drain annotation) it's just that it's not ideal ... good example would be Spark executor pods that while they can recover from a restart, it's better if they are allowed to finish their work at the expense of some temporary inefficiency in node utilization.

CAS has the --scale-down-utilization-threshold (along with the other flags mentioned) and seems like Karpenter needs a similar tunable. Unfortunately, we're seeing so much disruption in running pods b/c of consolidation that we can't use Karpenter in any of our active clusters.

thelabdude avatar Apr 28 '23 19:04 thelabdude

@thelabdude can't your pods set terminationGracePeriodSeconds https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination ?

FernandoMiguel avatar May 03 '23 15:05 FernandoMiguel

I'll have to think about termination grace period could help us but I wouldn't know what value to set and it would probably vary by workload ...

My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods.

I'm not saying a utilization threshold is the right answer for Karpenter but the current situation makes it hard to use in practice because we get too much pod churn due to consolidation and our nodes are never empty, so turning consolidation off isn't a solution either.

thelabdude avatar May 03 '23 16:05 thelabdude

Hey @thelabdude, this is a good callout of core differences in CA's deprovisioning and Karpenter's deprovisioning. Karpenter intentionally has chosen to not use a threshold, as for any threshold you create, due to the heterogenous nature of pod resource requests, you can create un-wanted edge cases that constantly need to be fine-tuned.

For more info, ConsolidationTTL here would simply act as a waiting mechanism between consolidation actions, which you can read more about here. Since this would essentially just be a wait, this will simply slow down the time Karpenter takes to get to the end state as you've described. One idea that may help is if Karpenter allows some configuration of the cost-benefit analysis that Consolidation does. This would need to be framed as either cost or utilization, both tough to get right.

If you're able to in the meantime, you can set do-not-evict on these pods you don't want consolidated, and you can also use the do-not-consolidate node annotation as well. More here.

njtran avatar May 08 '23 17:05 njtran

Are there any plans to implement or accept such a feature that adds some sort of time delay between node provisioning and consolidation? Perhaps based on the age of a node? The main advantage would be to increase stability during situations where there are surges in workload (scaling, scheduling, or roll outs).

tareks avatar May 14 '23 12:05 tareks

Hey can you just add delay before start consolidation after prod's change?

You can add several delays:

  1. After last deployed pod
  2. After last consolidation round

This will help to run consolidation during low activity on cluster.

Hronom avatar May 18 '23 22:05 Hronom

Also see issue https://github.com/aws/karpenter-core/issues/696: Exponential decay for cluster desired size

sftim avatar Jun 05 '23 16:06 sftim

This comment suggests another approach we might consider.

My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods.

(from https://github.com/aws/karpenter-core/issues/735)

Elsewhere in Kubernetes, ReplicaSets can pay attention to a Pod deletion cost.

For Karpenter, we could have a Machine or Node level deletion cost, and possibly a contrib controller that raises that cost based on what is running there.

Imagine that you have a controller that detects when Pods are bound to a Node, and updates the node deletion cost based on some quality of the Pod. For example: if you have a Pod annotated as starts-up-slowly, you set the node deletion cost for that node to 7 instead of the base value of 0. You'd also reset the value once the node didn't have any slow starting Pods.

sftim avatar Jun 05 '23 16:06 sftim

We are in need of something like this, as well. Consolidation is too aggressive for our application rollouts, and is causing more issues and failures than make it worth the cost of running extra nodes.

Ideally, we'd like for Karpenter to have the capability to recognize it just added nodes, it shouldn't immediately be throwing more churn into the mix to deprovision nodes, especially before all pods that triggered the initial node creation are ready and available. Some options that would help:

  • a configurable knob where the current hardcoded 10s poll interval is, so that we can make the consolidation check more like once an hour - we're not in so much of a crunch that an hour of extra compute would bankrupt us.
  • a setting like scale-down-delay-after-add as discussed above, used in cluster-autoscaler, to force Karpenter to allow some time for everything to become healthy before removing nodes
  • a ttlSecondsAfterUnderutilized setting within the consolidation configuration block, which would require Karpneter to make the first assessment that a node could be consolidated, and if after this ttl it still finds the same recommendation, then and only then would it work to consolidate that node. This means that if other activity occurs during that wait time (eg. pods added, removed, instance prices change, etc.) the evaluation may come to a different conclusion and the timer restarts. Yes, this means that a really high ttl or a high-churn cluster would struggle to actually have a consolidation take place, but if a user wants to configure this, then that is what they want -- they want consolidation to be less aggressive and occur less often -- so let them.

tstraley avatar Aug 22 '23 21:08 tstraley

As we're thinking about how to introduce better controls for consolidation, one of the questions we've come up against is whether or not Karpenter users care about having different TTLs or "knobs" for terminating under-utilized nodes compared to empty nodes.

React to this message with a 👍 if you'd prefer multiple, separate "knobs" for emptiness and under-utilization or use a 👎 if you'd like a single control for both. If you could also share a bit of detail about your use case below, that'd be even better!

akestner avatar Aug 22 '23 21:08 akestner

Not a use case exactly but if Karpenter doesn't duplicate kube-scheduler, and that's by design, then I think I also wouldn't duplicate descheduler and alternatives. That was why I picked :-1:.

If we one day want to enable complex behavior for selecting when and how to delete only some Pods from a partially empty node, I'd ideally want to coordinate with the Kubernetes project. That coordination is to find and agree on a way to mark (annotate or label):

  1. Pods that some tool thinks should be evicted, eg due to descheduling
  2. Nodes that some tool, such as Karpenter, recommends bringing into use (countermanding the previous proposed-eviction mark)

BTW Karpenter doesn't need a co-ordination point for it, but tainting a node that is due for removal means Pods shouldn't schedule there (if a whole load of unschedulable Pods turn up, Karpenter can always remove the taint and cancel an intended consolidation).

In that case of cancelled consolidation, I'm imagining that Karpenter also identifies the Pods labelled as pending eviction (for low node utilization) - and annotates the node to tell descheduler “wait, no, not the Pods on this node”.

Setting up those expectations will let cluster operators implement consolidation that fits their use case, by combining custom scheduling, custom descheduling, and custom [Karpenter] autoscaling. There are other designs such as a node deletion cost. Overall, I hope that we - Karpenter - find a way to play well with others for complex needs, and still meet the simple needs for cluster operators who are happy with the basic implementation.

Within Karpenter's domain - nodes and machines - it's fine to have customization because managing Nodes is what Karpenter is for. So, for example, Karpenter could wait some defined number of seconds after one consolidation operation before planning another. No objection to that delay, and it'd help manage hysteresis. Similarly, an are-you-sure period: “require Karpneter to make the first assessment that a node could be consolidated, and if after this ttl it still finds the same recommendation, then and only then would it work to consolidate that node“ sounds fine.

It's only when any of these knobs cross into the domain of scheduling and descheduling that I have concerns.

sftim avatar Aug 23 '23 04:08 sftim

We have been really satisfied so far how karpenter behave in our smaller clusters and staging. Couple days ago I have updated first production cluster - one that really has customer traffic and scales. Karpenter works well enough but I found we have some number of deployments that tend to scale aggressively up and down, even during one hour. I am attaching graph with number of replicas. This is not the only deployment that behaves like this and they cause karpenter to add ~5 nodes several times per hour. Just to be removed couple minutes later and wait for another round. I asked the team if this scaling really works for them and the answer was that yes, this is fine for them. They can smooth it with some HPA/v2 features but not really needed - it would be for karpenter's sake.

I think in this case if the nodes waited for a while (configurable amount of seconds or minutes) in the cluster, it would lower the total % of the allocatable CPU utilised but it would also lower the total churn. Because we run overprovisioning deployment to give us some free capacity buffer, this is not disrupting cluster workloads that much but we want to get rid of overprovisioning to lower cluster costs - one of the reasons is that with karpenter the scaling is even faster so we don't need the extra buffer.

We are however stopping karpenter rollout into all production clusters to see how this big node churn affects cost of the cluster (is it even better than with CAS?) or already mentioned traffic cost.

I know there was similar use case like this in this issue but I thought I can support that by our use case. Picture shows number of replicas over time. These are the requests/limits for this deployment:

    Limits:
      memory:  12Gi
    Requests:
      cpu:     600m
      memory:  2Gi
image

jan-ludvik avatar Sep 08 '23 12:09 jan-ludvik

As we're thinking about how to introduce better controls for consolidation, one of the questions we've come up against is whether or not Karpenter users care about having different TTLs or "knobs" for terminating under-utilized nodes compared to empty nodes.

Our cluster that sees the most variation in size is primarily used for CI jobs. With this type of workload:

  • We burst up from effectively zero to $LARGE_NUMBER_FOR_US nodes, depending on the work days
  • Requests vary widely, we might have a lot of activity during an hour while engineers are iterating, one big build, or everyone goes out to lunch that day.
  • Humans are waiting for many of the results, so we are sensitive to cold start time and churn.

(We also have a bunch of ML batch-y workloads that we want to use Karpenter more for that follow similar patterns.)

I think what would be most useful is something like aws/karpenter-core#696. That balance retaining capacity with having bursts of activity look like peaks instead of (expensive) plateaus. I don't particularly care if capacity is reduced from empty nodes or ones with low utilization, (Karpenter optimizing it out based on disruption budgets and cost sounds fine). So my answer for the "number of knobs" question is "as many as are needed to have something like exponential decaying capacity", but not more.

cburroughs avatar Sep 08 '23 13:09 cburroughs

Just to add another issue that is caused by the fast scaledown.

In our particular use case there are a lot of organization wide aws config rules that get evaluated every time a node comes up.

So in those days where there are a lot of burst of CI jobs we end up paying as much for config as we do for ec2. We reached a point where we're considering if keeping karpenter is still viable :disappointed:

mamoit avatar Sep 08 '23 14:09 mamoit

We're primarily seeing this when rolling out a replacement of a large deployment, 200+ pods. Karpenter goes absolutely crazy during this scale out/in to the point where the AWS load balancer controller we use starts to run into reconciliation throttling due to massive amounts of movement during the deployment. New containers will end up on a new node that only lives for 5 minutes. We see many nodes come up for 5 or 10 minutes during this one rolling deploy, before things settle down. Sometimes it gets into a cycle where the rollout takes 30 minutes, where without consolidation on it would take 3.

We're considering doing a patch of the provisioner to temporarily disable consolidation right before the rollout starts, waiting for the deployment to normalize, and then turning consolidation back on. This feels really ugly, but I think it would work in our specific case.

I wonder if some kind of pattern for this would be useful. I can't think of a great interface off the top of my head, but some way to signal to "pause" consolidation for a period of time in a way that doesn't mess with the provisioner such that our helm charts containing the provisioner template could always cleanly apply. Maybe?

  • Being able to trigger a consolidation on demand, leaving it up to the customer to kind of force a consolidation on their own schedule
  • A way to have a cron like schedule for consolidation

If we were able to make a consolidation happen when we decided, that would likely help us. We could run it outside working hours when we're shipping a lot.

bshelton229 avatar Oct 18 '23 20:10 bshelton229

The documentation in 0.32.1 for consolidateAfter is very confusing. Specifically, it states:

ConsolidateAfter is the duration the controller will wait before attempting to terminate nodes that are underutilized.

But it's not compatible with WhenUnderutilized! From the description, it sounds like it applies to "nodes that are underutilized" but is not the case.

thelabdude avatar Nov 06 '23 17:11 thelabdude

Hello everyone! I do agree with @thelabdude. I was expecting to use consolidateAfter along with WhenUnderutilized.

Our use case is similar to the ones mentioned above. I would really like Karpenter to optimize underutilized workloads.

The problem is that this can't happen during our application deployment, otherwise Spinnaker gets lost during enable/disable traffic for the blue/green deployment. Spinnaker will try to add a label to a pod that doesn't exist anymore due to Karpenter deleting the nodes.

The consolidateAfter works perfectly for this, but should be along with WhenUnderutilized.

I hope that make sense. Do we have any plans to support this?

mullermateus avatar Dec 20 '23 16:12 mullermateus