karpenter
karpenter copied to clipboard
feat: implement ConsolidateAfter
Fixes #735
Description
This implements allowing users to set ConsolidateAfter
with ConsolidationPolicy: WhenUnderutilized
.
This makes core changes across the codebase:
- The disruption controller no longer has three types of consolidation, and only looks for the v1beta1.Underutilized condition on NodeClaims, optimistically relying on pre-defined batches. a. This in turn removes the ConsolidationType() from the disruption methods, removing these labels from the metrics as well. b. This also removes the Consolidation batching timeouts, as we no longer take a long time doing consolidation decisions in the disruption controller.
- Cluster state now watches and maintains NodePool mappings rather than just being used for marking the cluster as unconsolidated.
- Cluster state now has the IsDisruptable() function which checks a bunch of details within cluster state to know when a NodeClaim should be considered disruptable.
- Adds an Underutilized NodeClaim status condition. When marking a NodeClaim as Underutilized, the Message will have a UUID that can be used to group NodeClaims in the same consolidation command.
- Moves some of the common disruption functions into the utils/disruption package.
TODO:
- Test profiling
- Add in suite tests for nodeclaim.underutilization controller
- Do scale tests
- Split this into smaller, more consumable PRs
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: njtran
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [njtran]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
The PR description includes a TODO. You might like to mark this as a draft PR @njtran
Good call @sftim, thanks
Whew, I did it 😆 Nice work! I'm excited to get this rolling!
From #1222, I'm no longer 100% clear what this means in practice
how long a node has to not have churn against it for us to consider it for consolidation
I would rather this considers whether any nodes in the pool have been created, not just whether a single node should be churned (unless the node to be removed is always the newest, which probably isn't the default?)
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.
/remove-lifecycle stale
Hey all, after some more design discussion, we've decided not to go with this approach, as it was computationally expensive, hard to reason about, and had to take too many guesses as to how the kube-scheduler would schedule pods in relation to our simulations.
Alternatively, the approach we're now running with for v1 is using consolidateAfter to filter out nodes for consolidation. When you set consolidationPolicy: WhenUnderutilized and (for example) set consolidateAfter: 1m, Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute. To achieve the same Consolidation behavior prior to v1, you can set consolidateAfter: 0s.
Let us know what you think of this approach.
@njtran wrote:
Alternatively, the approach we're now running with for v1 is using consolidateAfter to filter out nodes for consolidation. When you set consolidationPolicy: WhenUnderutilized and (for example) set consolidateAfter: 1m, Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute. To achieve the same Consolidation behavior prior to v1, you can set consolidateAfter: 0s.
I don't know why you would want pod removals to reset the timer.
I'm thinking more along the lines of:
- When a Node appears underutilized, label it with the current timestamp and taint it with a
PreferNoSchedule
taint - Continue to check the node at whatever interval you are using for consolidation checking.
- If the Node is no longer underutilized at the time you check it, remove the taint and label
- If the labeled timestamp +
consolidateAfter
is in the past, and the Node is still underutilized (and no Pods are annotated "do-not-disable"), shut it down
I don't think the case of a pod being scheduled and then deleted between checks is worth worrying about. If it is too expensive to calculate whether or not the Node is underutilized at each check, find some way to cache or estimate it, or just give up and only check at timestamp + consolidateAfter
.
I'd rather consider node changes than pod changes.
I don't want too much churn in my cluster too quickly - if a node has been added in the last ten minutes, don't remove any node
The reason for this is we have some applications with pods so large that they regularly trigger a new node - if the consolidation is too keen then we get multiple node creations and deletions in a ten minute period.
If we can say not to remove a node within ten minutes of node creation then that problem mostly goes away
When a Node appears underutilized
Hey @nuru, this is one of the toughest things to compute with the approach in this PR. Finding if a node is "underutilized" in Karpenter semantics really means "was this one of the first nodes that we could reschedule all the pods on the node elsewhere in the cluster and save on cost. For small clusters, this can be computed in the order of milliseconds, but in large clusters this can take on the order of minutes, maybe even longer.
@willthames @Nuru A thought I had was that reasoning about if a node is underutilized comes down to when pods schedule/deschedule. If I had a pod just recently schedule to my node, I may want to consider it utilized, rather than continually churn my pod. If I had a pod recently evicted from my node, I may want to have some leeway before Karpenter considers it underutilized.
@njtran wrote:
When a Node appears underutilized
Hey @Nuru, this is one of the toughest things to compute with the approach in this PR. Finding if a node is "underutilized" in Karpenter semantics really means "was this one of the first nodes that we could reschedule all the pods on the node elsewhere in the cluster and save on cost." For small clusters, this can be computed in the order of milliseconds, but in large clusters this can take on the order of minutes, maybe even longer.
OK, I covered this above.
You are going to have to compute "underutilized" at some point, or else you will never get anywhere. (If you are planning to get rid of "WhenUnderutilized" altogether, that is a different discussion.) Bin packing is NP-hard, so you probably are going with an estimate or heuristic to begin with (and if not, you should be), and you will want to come up with a similarly efficient short-cut to computing "WhenUnderutilized", which you can then verify in polynomial time so you don't accidentally scale down a Node that would have to be replaced.
But if you don't want to compute it every time, just cordon (with PreferNoSchedule) the Node the first time you find it to be underutilized. Then you can just leave it like that until "ConsolidateAfter" or a scale-up event, whichever comes first. If a scale-up event happens, then by definition none of the existing nodes are underutilized, so clear the cordons and timers. Otherwise, recheck the cordoned node at "ConsolidateAfter" time. If it is still underutilized at that time, delete the Node.
In this algorithm, you only have to determine if a Node is underutilized twice. If that is still too expensive, you can just mark how utilized (how much CPU and Memory have been allocated) when you first cordoned the Node, then at "ConslidateAfter" time, verify that no more CPU or Memory has been allocated (indicating that PreferNoSchedule has been honored) and no other Nodes have been removed by the cluster.