Exponential / logarithmic decay for cluster desired size
Tell us about your request
When Karpenter is running more node capacity than the cluster requires, use an exponential decay (ie, something with a half life) rather than dropping desired capacity instantly.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
As a cluster operator, When my workloads scale in on my cluster I want to preserve capacity So that short-term drops in workload scale don't disrupt service.
I'm suggesting exponential decay because it's easy to implement with two fields (eg: in the .status of each Provisioner)
- the most recent, post-decay, value
- a timestamp for that value. Either with subsecond precision, or scale the value to match the timestamp at the beginning of a second
With some not so complex math, you can then evaluate the decayed value for any subsequent instant. You can write it back into the status (eg using a JSON Patch) and you can act on it as well.
This might better support:
- gradual scale down with different nodes eg: cluster running on 3 × 192 CPU metal instances, 560 CPU desired size, memory not a constraint A bunch of those Pods stop running Time passes With a small decay constant set, the desired size decays slowly to 507 CPUs Karpenter predicts / observes the slow decay and replaces 1 of the 192 CPU instances, with 2 smaller 64 CPU instances.
- holding capacity over brief breaks eg: lunch A workload's utilization drops in some region over lunch to nearly 0, but the Pods take 2 to 3 minutes to initialise. Pod level predictive autoscaling already accounts for this but there is a 10 to 20 minute period where do Pods scale in due to low utilisation. After lunch, the load on the cluster is usually lower than in the morning, but is still quite high. The cluster operator would like to run preemptible batch work using the spare capacity, ready to be replaced by workload Pods when the lunch break is nearly over, and sets a decay constant to make sure to avoid too much node-level scale in.
- live event capacity A workload supports a live event. Queue processing runs in Kubernetes and is scaled to a high level for the event itself. After the event the site remains popular but with bursty load. A cluster operator would like to save money through consolidation and also wants to turn off unused nodes promptly to save on costs. However, turning nodes off too quickly turns out to have its own cost implications: additional nodes get launched to replace kubelets on instances that are only just starting their shutdown process.
Alternative
rather than exponential decay, use another function such as logarithmic decay. That would hold the instance count for a duration and then let it drop off. That might better fit cases where cluster operators want to minimize instance terminations.
Are you currently working around this issue?
(eg)
scaleDown policies on HorizontalPodAutoscaler. However, these affect single workloads. A correlated scale-in could still take away node capacity that I, as a cluster operator, know will take time to reprovision if needed.
Additional Context
Also see https://kubernetes.slack.com/archives/C02SFFZSA2K/p1685980025031979?thread_ts=1685960637.488689&cid=C02SFFZSA2K
Attachments
No response
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
https://github.com/aws/karpenter-core/issues/735 adds a user story relevant to this: minimizing the AWS Config costs from frequent provisioning / termination cycles for EC2 instances.
Thinking about this in the perspective of disruption budgets: could this be implemented by a budget with a percentage?
Let's say I had 1000 nodes in my cluster, and let's say they're all empty, meaning that the desired state would be scale to 0. With a disruption budget of 10%, you could achieve the same logarithmic decay, by effectively scaling down the cluster in progressively smaller batches, eventually scaling down to 0.
1000 ( - 100) -> 900 ( - 90) -> 810 ( - 81) -> 729 ( - 73) -> 656 (-66) -> 590 -> ... -> 0
This effectively solves the problem of exponential decay, in my eyes. @sftim thoughts?
One consideration is that this drifts from perfectly exponential the more heterogenous the instance sizes are. Yet, the super nice part is that this effectively gets solved for free with an already existing design/implementation in progress.
There's two shapes for decay. For scale-in, these are:
- big steps first, then smaller and smaller steps (exponential)
- small reductions at first, then bigger and bigger steps (logarithmic)
I actually think the second case is more relevant. People want to keep nodes around in case the load comes back, but eventually they still want their monthly bill to go down.
On the node size thing, we could implement this where you specify the dimension you care about. For example, decay the total vCPU count for a NodePool. Or the node count, or the memory total. Maybe even the Pod capacity?
/retitle Exponential / logarithmic decay for cluster desired size
If we plan to implement just one of these, that could turn into a separate more specific issue.
On the node size thing, we could implement this where you specify the dimension you care about. For example, decay the total vCPU count for a NodePool. Or the node count, or the memory total. Maybe even the Pod capacity?
This totally makes sense. There was some feedback that DisruptionBudgets should refer to more than just nodes, which seems super similar to this request.
big steps first, then smaller and smaller steps (exponential) small reductions at first, then bigger and bigger steps (logarithmic)
I understand the use case in doing big steps first with progressively smaller steps, and that's naturally implemented with a budgets.
What's the use-case for doing smaller steps with progressively larger steps? That sounds like it would be something like doing 1000 -> 999 -> 997 -> 993 -> 985 -> 969 -> 937 -> 873 -> 745 -> 489 -> 0. While not impossible, I think this would be harder to model, since you have to be aware of previous steps to know the next step.
Let's do the simpler thing then, with exponential decay.
scaling down the cluster in progressively smaller batches, eventually scaling down to 0
I do think it's nicer to scale in without the jaggedness this implies. Each time the desired size drops below the integer actual count of nodes, I think a cluster operator would hope to see a drain happening - and eventually an instance termination.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.