karpenter
karpenter copied to clipboard
Vertical resizer for Karpenter
Tell us about your request
My team just started using Karpenter in our clusters. One thing that was missing from the Karpenter deployment is the nanny that we used to have with Cluster Autoscaler. This is a bit inconvenient given that our clusters grow in time as more services are deployed on the clusters and we have to monitor the memory usage and manually bump up the resource requests. Do we already have something that we can use for Karpenter or is it on the roadmap?
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Stated in the request.
Are you currently working around this issue?
Manually bumping up the resources requests for now.
Additional Context
No response
Attachments
No response
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
How does this work for CAS?
How does this work for CAS?
This is what we use: https://github.com/kubernetes/autoscaler/blob/master/addon-resizer/README.md
Have you tried using this with Karpenter? I'm not aware of anyone trying this yet.
Yeah I can give it a try. Was just wondering if there's a custom one made for Karpenter.
Not yet, but looking forward to what you learn.
Ellis, while I have you on the thread, could you give a little insight on what causes Karpenter to use more CPU and memory? Do they grow proportionally with the number of the nodes and pods in the cluster?
#pods and #nodes definitely contribute. Consolidation definitely adds to it. We haven't profiled a ton, but you can enable a flag to turn on profiling https://karpenter.sh/v0.22.0/concepts/settings/
To add to this, we have been testing Karpenter 0.16.0 for a little over a month, and in some of our clusters we are observing what looks like a memory leak in Kerpenter's controller container.
For example, for one of our clusters, karpenter controller container begins its life with a memory footprint (cAdvisor metric container_memory_usage_bytes) at about 600 MB.
After 30 days of activity, it's at 1.67 GB.
The number of nodes in the cluster remains relatively stable over time, but the memory footprint of Karpenter controller shows steady growth throughout the month.
From functional perspective Karpenter seems to be working as expected - nodes are provisioned and deprovisioned often as a result of active consolidation.
For what it's worth, the memory footprint of webhook container remains steady at around 25 MB.
Wow thanks for the report -- we'll dig into this.
Do you mind cutting a new issue detailing your observations?
Done: aws/karpenter#3209
There is also https://github.com/kubernetes-sigs/cluster-proportional-autoscaler which, I think, might help.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
We are also looking for a way to scale Karpenter under heavy usage.
We have a cluster that scales from ~10 nodes to ~500 nodes and Karpenter memory grows to ~5GB of memory.
There's also VPA but I've never used it https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
Another approach that could be implemented in Karpenter is horizontal scaling with sharding (each instance of Karpenter could be responsible for some nodes / pods). I've found a controller using a similar approach: https://kubevela.io/docs/v1.7/platform-engineers/system-operation/controller-sharding/ However, that seems like a huge work.
There is also https://github.com/kubernetes-sigs/cluster-proportional-autoscaler which, I think, might help.
https://github.com/kubernetes-sigs/karpenter/issues/733#issuecomment-1790110138
Hi @sftim,
It seems that cluster-proportional-autoscaler scales the number of replicas.
If I'm not wrong, that would be useless for Karpenter to only increase the number of replicas, as we need to scale vertically.
vertical-pod-autoscaler and addon-resizer does scale vertically (cpu & memory).
Sorry, I was thinking of https://github.com/kubernetes-sigs/cluster-proportional-vertical-autoscaler
As there's already a vertical autoscaler that seems to fit this use case: /priority awaiting-more-evidence
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.