autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Steady memory leak in VPA recommender

Open DLakin01 opened this issue 1 year ago • 6 comments

Which component are you using?:

vertical-pod-autoscaler, recommender only

What version of the component are you using?

0.14.0

What k8s version are you using (kubectl version)?:

1.26

What environment is this in?:

AWS EKS, multiple clusters and accounts, multiple types of applications running on the cluster

What did you expect to happen?:

VPA recommender should run at more or less at the same memory level throughout the lifetime of a particular pod

What happened instead?:

There is a steady memory leak that is especially visible over a period of days, as seen here in a screen capture of our DataDog: image

The upper lines with the steeper slope are from our large multi-tenant clusters, but the smaller clusters also experience the leak, albeit more slowly. If left alone, the memory will reach 200% of requests before the pod gets kicked. The recommender in the largest cluster is tracking 3161 PodStates at the time of creating this issue

How to reproduce it (as minimally and precisely as possible):

Not sure how reproducible the issue is outside of running VPA in a large cluster with > 3000 pods and waiting several days to see if the memory creeps up.

Anything else we need to know?:

We haven't yet created any VPA CRDs to generate recommendations, waiting until a future sprint to begin rolling those out.

DLakin01 avatar Dec 11 '23 21:12 DLakin01

We also face the same issue. Our version is 0.11 with k8s version 1.24. Below is grafana snippet from the last restart. image

vkhacharia avatar Feb 26 '24 09:02 vkhacharia

Hey @vkhacharia @DLakin01 thanks for bringing this up!

To some extend, this behavior is expected and given only these graphs it is hard to tell, if the behavior is normal or not. The recommender keeps metrics for each container, regardless if that container is under VPA control or not. I guess the reasoning is that you get accurate recommendations immediately if you would decide to enable VPA for this container at a later point in time. You can switch off this default behavior by enabling memory saver mode.

Even with memory saver mode enabled, there's some grow in memory expected:

So if you're rolling approximately the same number of times per week, your memory is expected to grow for ~2 weeks. If you're adding Containers and don't have memory saver mode enabled, memory will grow with every Container.

If all of those parameters are controlled and you still see memory growth, I guess this really is a memory leak that shouldn't happen.

voelzmo avatar Feb 26 '24 10:02 voelzmo

@voelzmo Thanks for the quick response, I wanted to try it now but noticed that I am on k8s version 1.24 which has compabitility with 0.11 of vpa recommender. I dont see the parameter memory-saver in code in branch for version 0.11.

vkhacharia avatar Mar 05 '24 07:03 vkhacharia

Hey @vkhacharia, thanks for your efforts! VPA 0.11.0 also has memory-saver mode, but the parameter is in a different place and was moved to the above section in the code with a refactoring that happened later.

So you can still turn on --memory-saver=true and see what this does for you. Hope that helps!

voelzmo avatar Mar 05 '24 09:03 voelzmo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 03 '24 09:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 03 '24 10:07 k8s-triage-robot

/area vertical-pod-autoscaler

adrianmoisey avatar Jul 08 '24 18:07 adrianmoisey

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 07 '24 19:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 07 '24 19:08 k8s-ci-robot