karpenter Vertical resizer for Karpenter

Tell us about your request

My team just started using Karpenter in our clusters. One thing that was missing from the Karpenter deployment is the nanny that we used to have with Cluster Autoscaler. This is a bit inconvenient given that our clusters grow in time as more services are deployed on the clusters and we have to monitor the memory usage and manually bump up the resource requests. Do we already have something that we can use for Karpenter or is it on the roadmap?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Stated in the request.

Are you currently working around this issue?

Manually bumping up the resources requests for now.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Jan 10 '23 22:01 easildur24

How does this work for CAS?

Jan 10 '23 23:01 ellistarn

How does this work for CAS?

This is what we use: https://github.com/kubernetes/autoscaler/blob/master/addon-resizer/README.md

Jan 10 '23 23:01 easildur24

Have you tried using this with Karpenter? I'm not aware of anyone trying this yet.

Jan 10 '23 23:01 ellistarn

Yeah I can give it a try. Was just wondering if there's a custom one made for Karpenter.

Jan 10 '23 23:01 easildur24

Not yet, but looking forward to what you learn.

Jan 11 '23 01:01 ellistarn

Ellis, while I have you on the thread, could you give a little insight on what causes Karpenter to use more CPU and memory? Do they grow proportionally with the number of the nodes and pods in the cluster?

Jan 11 '23 01:01 easildur24

#pods and #nodes definitely contribute. Consolidation definitely adds to it. We haven't profiled a ton, but you can enable a flag to turn on profiling https://karpenter.sh/v0.22.0/concepts/settings/

Jan 11 '23 06:01 ellistarn

To add to this, we have been testing Karpenter 0.16.0 for a little over a month, and in some of our clusters we are observing what looks like a memory leak in Kerpenter's controller container.

For example, for one of our clusters, karpenter controller container begins its life with a memory footprint (cAdvisor metric container_memory_usage_bytes) at about 600 MB. After 30 days of activity, it's at 1.67 GB.

The number of nodes in the cluster remains relatively stable over time, but the memory footprint of Karpenter controller shows steady growth throughout the month.

From functional perspective Karpenter seems to be working as expected - nodes are provisioned and deprovisioned often as a result of active consolidation.

For what it's worth, the memory footprint of webhook container remains steady at around 25 MB.

Jan 11 '23 14:01 vassilvk

Wow thanks for the report -- we'll dig into this.

Jan 11 '23 18:01 ellistarn

Do you mind cutting a new issue detailing your observations?

Jan 11 '23 18:01 ellistarn

Done: aws/karpenter#3209

Jan 12 '23 20:01 vassilvk

There is also https://github.com/kubernetes-sigs/cluster-proportional-autoscaler which, I think, might help.

Jun 05 '23 17:06 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 31 '24 18:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 01 '24 19:03 k8s-triage-robot

/remove-lifecycle rotten

We are also looking for a way to scale Karpenter under heavy usage.

We have a cluster that scales from ~10 nodes to ~500 nodes and Karpenter memory grows to ~5GB of memory.

There's also VPA but I've never used it https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler

Another approach that could be implemented in Karpenter is horizontal scaling with sharding (each instance of Karpenter could be responsible for some nodes / pods). I've found a controller using a similar approach: https://kubevela.io/docs/v1.7/platform-engineers/system-operation/controller-sharding/ However, that seems like a huge work.

Mar 27 '24 10:03 gnuletik

There is also https://github.com/kubernetes-sigs/cluster-proportional-autoscaler which, I think, might help.

https://github.com/kubernetes-sigs/karpenter/issues/733#issuecomment-1790110138

Mar 27 '24 10:03 sftim

Hi @sftim, It seems that cluster-proportional-autoscaler scales the number of replicas. If I'm not wrong, that would be useless for Karpenter to only increase the number of replicas, as we need to scale vertically.

vertical-pod-autoscaler and addon-resizer does scale vertically (cpu & memory).

Mar 27 '24 10:03 gnuletik

Sorry, I was thinking of https://github.com/kubernetes-sigs/cluster-proportional-vertical-autoscaler

Mar 27 '24 12:03 sftim

As there's already a vertical autoscaler that seems to fit this use case: /priority awaiting-more-evidence

Mar 27 '24 12:03 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 25 '24 13:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 25 '24 13:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Aug 24 '24 14:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 24 '24 14:08 k8s-ci-robot

karpenter karpenter copied to clipboard

Vertical resizer for Karpenter

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

karpenter
karpenter copied to clipboard