karpenter feat: implement ConsolidateAfter

Fixes #735

Description This implements allowing users to set ConsolidateAfter with ConsolidationPolicy: WhenUnderutilized.

This makes core changes across the codebase:

The disruption controller no longer has three types of consolidation, and only looks for the v1beta1.Underutilized condition on NodeClaims, optimistically relying on pre-defined batches. a. This in turn removes the ConsolidationType() from the disruption methods, removing these labels from the metrics as well. b. This also removes the Consolidation batching timeouts, as we no longer take a long time doing consolidation decisions in the disruption controller.
Cluster state now watches and maintains NodePool mappings rather than just being used for marking the cluster as unconsolidated.
Cluster state now has the IsDisruptable() function which checks a bunch of details within cluster state to know when a NodeClaim should be considered disruptable.
Adds an Underutilized NodeClaim status condition. When marking a NodeClaim as Underutilized, the Message will have a UUID that can be used to group NodeClaims in the same consolidation command.
Moves some of the common disruption functions into the utils/disruption package.

TODO:

Test profiling
Add in suite tests for nodeclaim.underutilization controller
Do scale tests
Split this into smaller, more consumable PRs

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

May 01 '24 18:05 njtran

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 01 '24 18:05 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njtran

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [njtran]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 01 '24 18:05 k8s-ci-robot

The PR description includes a TODO. You might like to mark this as a draft PR @njtran

May 03 '24 15:05 sftim

Good call @sftim, thanks

May 07 '24 18:05 njtran

Whew, I did it 😆 Nice work! I'm excited to get this rolling!

May 09 '24 03:05 jonathan-innis

From #1222, I'm no longer 100% clear what this means in practice

how long a node has to not have churn against it for us to consider it for consolidation

I would rather this considers whether any nodes in the pool have been created, not just whether a single node should be churned (unless the node to be removed is always the newest, which probably isn't the default?)

Jun 19 '24 00:06 willthames

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Jul 03 '24 12:07 github-actions[bot]

/remove-lifecycle stale

Jul 03 '24 18:07 Nuru

Hey all, after some more design discussion, we've decided not to go with this approach, as it was computationally expensive, hard to reason about, and had to take too many guesses as to how the kube-scheduler would schedule pods in relation to our simulations.

Alternatively, the approach we're now running with for v1 is using consolidateAfter to filter out nodes for consolidation. When you set consolidationPolicy: WhenUnderutilized and (for example) set consolidateAfter: 1m, Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute. To achieve the same Consolidation behavior prior to v1, you can set consolidateAfter: 0s.

Let us know what you think of this approach.

Jul 08 '24 17:07 njtran

@njtran wrote:

Alternatively, the approach we're now running with for v1 is using consolidateAfter to filter out nodes for consolidation. When you set consolidationPolicy: WhenUnderutilized and (for example) set consolidateAfter: 1m, Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute. To achieve the same Consolidation behavior prior to v1, you can set consolidateAfter: 0s.

I don't know why you would want pod removals to reset the timer.

I'm thinking more along the lines of:

When a Node appears underutilized, label it with the current timestamp and taint it with a PreferNoSchedule taint
Continue to check the node at whatever interval you are using for consolidation checking.
- If the Node is no longer underutilized at the time you check it, remove the taint and label
- If the labeled timestamp + consolidateAfter is in the past, and the Node is still underutilized (and no Pods are annotated "do-not-disable"), shut it down

I don't think the case of a pod being scheduled and then deleted between checks is worth worrying about. If it is too expensive to calculate whether or not the Node is underutilized at each check, find some way to cache or estimate it, or just give up and only check at timestamp + consolidateAfter.

Jul 08 '24 22:07 Nuru

I'd rather consider node changes than pod changes.

I don't want too much churn in my cluster too quickly - if a node has been added in the last ten minutes, don't remove any node

The reason for this is we have some applications with pods so large that they regularly trigger a new node - if the consolidation is too keen then we get multiple node creations and deletions in a ten minute period.

If we can say not to remove a node within ten minutes of node creation then that problem mostly goes away

Jul 08 '24 23:07 willthames

When a Node appears underutilized

Hey @nuru, this is one of the toughest things to compute with the approach in this PR. Finding if a node is "underutilized" in Karpenter semantics really means "was this one of the first nodes that we could reschedule all the pods on the node elsewhere in the cluster and save on cost. For small clusters, this can be computed in the order of milliseconds, but in large clusters this can take on the order of minutes, maybe even longer.

@willthames @Nuru A thought I had was that reasoning about if a node is underutilized comes down to when pods schedule/deschedule. If I had a pod just recently schedule to my node, I may want to consider it utilized, rather than continually churn my pod. If I had a pod recently evicted from my node, I may want to have some leeway before Karpenter considers it underutilized.

Jul 09 '24 14:07 njtran

@njtran wrote:

When a Node appears underutilized

Hey @Nuru, this is one of the toughest things to compute with the approach in this PR. Finding if a node is "underutilized" in Karpenter semantics really means "was this one of the first nodes that we could reschedule all the pods on the node elsewhere in the cluster and save on cost." For small clusters, this can be computed in the order of milliseconds, but in large clusters this can take on the order of minutes, maybe even longer.

OK, I covered this above.

You are going to have to compute "underutilized" at some point, or else you will never get anywhere. (If you are planning to get rid of "WhenUnderutilized" altogether, that is a different discussion.) Bin packing is NP-hard, so you probably are going with an estimate or heuristic to begin with (and if not, you should be), and you will want to come up with a similarly efficient short-cut to computing "WhenUnderutilized", which you can then verify in polynomial time so you don't accidentally scale down a Node that would have to be replaced.

But if you don't want to compute it every time, just cordon (with PreferNoSchedule) the Node the first time you find it to be underutilized. Then you can just leave it like that until "ConsolidateAfter" or a scale-up event, whichever comes first. If a scale-up event happens, then by definition none of the existing nodes are underutilized, so clear the cordons and timers. Otherwise, recheck the cordoned node at "ConsolidateAfter" time. If it is still underutilized at that time, delete the Node.

In this algorithm, you only have to determine if a Node is underutilized twice. If that is still too expensive, you can just mark how utilized (how much CPU and Memory have been allocated) when you first cordoned the Node, then at "ConslidateAfter" time, verify that no more CPU or Memory has been allocated (indicating that PreferNoSchedule has been honored) and no other Nodes have been removed by the cluster.

Jul 10 '24 11:07 Nuru

karpenter karpenter copied to clipboard

feat: implement ConsolidateAfter

karpenter
karpenter copied to clipboard