karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Vertical resizer for Karpenter

Open easildur24 opened this issue 2 years ago • 21 comments

Tell us about your request

My team just started using Karpenter in our clusters. One thing that was missing from the Karpenter deployment is the nanny that we used to have with Cluster Autoscaler. This is a bit inconvenient given that our clusters grow in time as more services are deployed on the clusters and we have to monitor the memory usage and manually bump up the resource requests. Do we already have something that we can use for Karpenter or is it on the roadmap?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Stated in the request.

Are you currently working around this issue?

Manually bumping up the resources requests for now.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

easildur24 avatar Jan 10 '23 22:01 easildur24

How does this work for CAS?

ellistarn avatar Jan 10 '23 23:01 ellistarn

How does this work for CAS?

This is what we use: https://github.com/kubernetes/autoscaler/blob/master/addon-resizer/README.md

easildur24 avatar Jan 10 '23 23:01 easildur24

Have you tried using this with Karpenter? I'm not aware of anyone trying this yet.

ellistarn avatar Jan 10 '23 23:01 ellistarn

Yeah I can give it a try. Was just wondering if there's a custom one made for Karpenter.

easildur24 avatar Jan 10 '23 23:01 easildur24

Not yet, but looking forward to what you learn.

ellistarn avatar Jan 11 '23 01:01 ellistarn

Ellis, while I have you on the thread, could you give a little insight on what causes Karpenter to use more CPU and memory? Do they grow proportionally with the number of the nodes and pods in the cluster?

easildur24 avatar Jan 11 '23 01:01 easildur24

#pods and #nodes definitely contribute. Consolidation definitely adds to it. We haven't profiled a ton, but you can enable a flag to turn on profiling https://karpenter.sh/v0.22.0/concepts/settings/

ellistarn avatar Jan 11 '23 06:01 ellistarn

To add to this, we have been testing Karpenter 0.16.0 for a little over a month, and in some of our clusters we are observing what looks like a memory leak in Kerpenter's controller container.

For example, for one of our clusters, karpenter controller container begins its life with a memory footprint (cAdvisor metric container_memory_usage_bytes) at about 600 MB. After 30 days of activity, it's at 1.67 GB.

The number of nodes in the cluster remains relatively stable over time, but the memory footprint of Karpenter controller shows steady growth throughout the month.

From functional perspective Karpenter seems to be working as expected - nodes are provisioned and deprovisioned often as a result of active consolidation.

For what it's worth, the memory footprint of webhook container remains steady at around 25 MB.

vassilvk avatar Jan 11 '23 14:01 vassilvk

Wow thanks for the report -- we'll dig into this.

ellistarn avatar Jan 11 '23 18:01 ellistarn

Do you mind cutting a new issue detailing your observations?

ellistarn avatar Jan 11 '23 18:01 ellistarn

Done: aws/karpenter#3209

vassilvk avatar Jan 12 '23 20:01 vassilvk

There is also https://github.com/kubernetes-sigs/cluster-proportional-autoscaler which, I think, might help.

sftim avatar Jun 05 '23 17:06 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 31 '24 18:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 01 '24 19:03 k8s-triage-robot

/remove-lifecycle rotten

We are also looking for a way to scale Karpenter under heavy usage.

We have a cluster that scales from ~10 nodes to ~500 nodes and Karpenter memory grows to ~5GB of memory.

There's also VPA but I've never used it https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler

Another approach that could be implemented in Karpenter is horizontal scaling with sharding (each instance of Karpenter could be responsible for some nodes / pods). I've found a controller using a similar approach: https://kubevela.io/docs/v1.7/platform-engineers/system-operation/controller-sharding/ However, that seems like a huge work.

gnuletik avatar Mar 27 '24 10:03 gnuletik

There is also https://github.com/kubernetes-sigs/cluster-proportional-autoscaler which, I think, might help.

https://github.com/kubernetes-sigs/karpenter/issues/733#issuecomment-1790110138

sftim avatar Mar 27 '24 10:03 sftim

Hi @sftim, It seems that cluster-proportional-autoscaler scales the number of replicas. If I'm not wrong, that would be useless for Karpenter to only increase the number of replicas, as we need to scale vertically.

vertical-pod-autoscaler and addon-resizer does scale vertically (cpu & memory).

gnuletik avatar Mar 27 '24 10:03 gnuletik

Sorry, I was thinking of https://github.com/kubernetes-sigs/cluster-proportional-vertical-autoscaler

sftim avatar Mar 27 '24 12:03 sftim

As there's already a vertical autoscaler that seems to fit this use case: /priority awaiting-more-evidence

sftim avatar Mar 27 '24 12:03 sftim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 25 '24 13:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 25 '24 13:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 24 '24 14:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 24 '24 14:08 k8s-ci-robot