cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

Memory leak in the controller

Open dkoshkin opened this issue 2 years ago • 21 comments

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened: [A clear and concise description of what the bug is.] Deploy CAPZ v1.3.2 and let it run for some time. Memory leak in the controller even when there are no Azure/AKS clusters.

Here is a graph with just a single non-Azure/AKS cluster: image

A steeper graph when scaling to ~100 clusters (still 0 Azure/AKS clusters) image

What did you expect to happen: The controller should be stable when there are no Azure/AKS clusters.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] I enabled pprof to capture some info, here is what it looked like when first starting up image

And then again after a couple of hours: image

It pointed us to a potential culprit, a growing slice: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/util/tele/span_logger.go#L92

I added some debug logging on the length and capacity of that slice and after a few hours its >100,000 KV pairs.


A different view in pprof provided some more info on where to look: image

It doesn't give the exact function, but some hints towards https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/controllers/azurecluster_controller.go#L88

Looks like the log is instantiated once when SetupWithManager is called in main.go https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/controllers/azurecluster_controller.go#L73-L78

This is different from how other providers setup their loggers and gets a new instance of it on every reconcile and does not have a growing slice of KV pairs.

Environment:

  • cluster-api-provider-azure version: v1.3.2
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

dkoshkin avatar Sep 07 '22 23:09 dkoshkin

I thought that maybe the way the KVs are appeneded to the slice may be causing an issue in Go and tried a similar approach to how its done in https://github.com/kubernetes/klog/blob/02fe3234c86b222bc85406122bd0d11acbe5363a/internal/serialize/keyvalues.go#L39-L42

But that just leads to a spiky memory leak. Notice how the spikes happen every 10 minutes, on a periodic reconcile. image

All sings point to a bug in how the log gets reused and KVs are appended to it across multiple reconciles.

dkoshkin avatar Sep 07 '22 23:09 dkoshkin

@dkoshkin this is excellent (not that there's a leak, but that you did this profiling work)!

I assume the problem exists in 1.4.2 and main as well, but when time permits I'll try pprof on that.

mboersma avatar Sep 08 '22 16:09 mboersma

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 07 '22 16:12 k8s-triage-robot

/remove-lifecycle stale

jackfrancis avatar Dec 07 '22 19:12 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 07 '23 19:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 06 '23 20:04 k8s-triage-robot

/remove-lifecycle rotten

nawazkh avatar Apr 06 '23 21:04 nawazkh

The problem still exists in v1.7.1

Screenshot_2023-04-19_17-52-09

jonathanbeber avatar Apr 19 '23 20:04 jonathanbeber

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 18 '23 21:07 k8s-triage-robot

/remove-lifecycle stale

mboersma avatar Jul 20 '23 16:07 mboersma

Matt couldn't reproduce this behavior and we are going to be doing extensive performance testing with https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3547 It has also been a long time since this issue has been opened (with numerous releases since then). For these reasons, we're marking this as needs-more-evidence.

dtzar avatar Aug 03 '23 17:08 dtzar

I'm still seeing that in 1.8.5, is it already too old to be considered?

Otherwise I'm happy to provide more logs/details.

Screenshot_2023-08-03_17-03-53

jonathanbeber avatar Aug 03 '23 20:08 jonathanbeber

I'd like to see if it still exists in the latest 1.10.x.
I also need to ask - Is the load on CAPZ increasing over time by chance? e.g. are you adding more clusters to manage over the course of a day?

dtzar avatar Aug 09 '23 19:08 dtzar

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

jonathanbeber avatar Aug 10 '23 17:08 jonathanbeber

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 26 '24 05:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 25 '24 06:02 k8s-triage-robot

/remove-lifecycle rotten

nawazkh avatar Feb 26 '24 17:02 nawazkh

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

Hey @jonathanbeber , Did you get a chance to probe further. Or do you have any recommended steps in probing this issue so that we can take it over and check it ourselves?

nawazkh avatar Feb 26 '24 17:02 nawazkh

I'm sorry, @nawazkh, I don't have any contact with CAPZ controllers anymore. You will be able to see the you by simply running the controller and having some metrics collector in place (e.g. prometheus + grafana).

jonathanbeber avatar Feb 26 '24 18:02 jonathanbeber

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 26 '24 18:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 25 '24 18:06 k8s-triage-robot