cluster-api-provider-azure Memory leak in the controller

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened: [A clear and concise description of what the bug is.] Deploy CAPZ v1.3.2 and let it run for some time. Memory leak in the controller even when there are no Azure/AKS clusters.

Here is a graph with just a single non-Azure/AKS cluster:

A steeper graph when scaling to ~100 clusters (still 0 Azure/AKS clusters)

What did you expect to happen: The controller should be stable when there are no Azure/AKS clusters.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] I enabled pprof to capture some info, here is what it looked like when first starting up

And then again after a couple of hours:

It pointed us to a potential culprit, a growing slice: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/util/tele/span_logger.go#L92

I added some debug logging on the length and capacity of that slice and after a few hours its >100,000 KV pairs.

A different view in pprof provided some more info on where to look:

It doesn't give the exact function, but some hints towards https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/controllers/azurecluster_controller.go#L88

Looks like the log is instantiated once when SetupWithManager is called in main.go https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/controllers/azurecluster_controller.go#L73-L78

This is different from how other providers setup their loggers and gets a new instance of it on every reconcile and does not have a growing slice of KV pairs.

Environment:

cluster-api-provider-azure version: v1.3.2
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Sep 07 '22 23:09 dkoshkin

I thought that maybe the way the KVs are appeneded to the slice may be causing an issue in Go and tried a similar approach to how its done in https://github.com/kubernetes/klog/blob/02fe3234c86b222bc85406122bd0d11acbe5363a/internal/serialize/keyvalues.go#L39-L42

But that just leads to a spiky memory leak. Notice how the spikes happen every 10 minutes, on a periodic reconcile.

All sings point to a bug in how the log gets reused and KVs are appended to it across multiple reconciles.

Sep 07 '22 23:09 dkoshkin

@dkoshkin this is excellent (not that there's a leak, but that you did this profiling work)!

I assume the problem exists in 1.4.2 and main as well, but when time permits I'll try pprof on that.

Sep 08 '22 16:09 mboersma

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 07 '22 16:12 k8s-triage-robot

/remove-lifecycle stale

Dec 07 '22 19:12 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 07 '23 19:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 06 '23 20:04 k8s-triage-robot

/remove-lifecycle rotten

Apr 06 '23 21:04 nawazkh

The problem still exists in v1.7.1

Screenshot_2023-04-19_17-52-09

Apr 19 '23 20:04 jonathanbeber

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 18 '23 21:07 k8s-triage-robot

/remove-lifecycle stale

Jul 20 '23 16:07 mboersma

Matt couldn't reproduce this behavior and we are going to be doing extensive performance testing with https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3547 It has also been a long time since this issue has been opened (with numerous releases since then). For these reasons, we're marking this as needs-more-evidence.

Aug 03 '23 17:08 dtzar

I'm still seeing that in 1.8.5, is it already too old to be considered?

Otherwise I'm happy to provide more logs/details.

Screenshot_2023-08-03_17-03-53

Aug 03 '23 20:08 jonathanbeber

I'd like to see if it still exists in the latest 1.10.x.
I also need to ask - Is the load on CAPZ increasing over time by chance? e.g. are you adding more clusters to manage over the course of a day?

Aug 09 '23 19:08 dtzar

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

Aug 10 '23 17:08 jonathanbeber

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 26 '24 05:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 25 '24 06:02 k8s-triage-robot

/remove-lifecycle rotten

Feb 26 '24 17:02 nawazkh

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

Hey @jonathanbeber , Did you get a chance to probe further. Or do you have any recommended steps in probing this issue so that we can take it over and check it ourselves?

Feb 26 '24 17:02 nawazkh

I'm sorry, @nawazkh, I don't have any contact with CAPZ controllers anymore. You will be able to see the you by simply running the controller and having some metrics collector in place (e.g. prometheus + grafana).

Feb 26 '24 18:02 jonathanbeber

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 26 '24 18:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 25 '24 18:06 k8s-triage-robot

cluster-api-provider-azure cluster-api-provider-azure copied to clipboard

Memory leak in the controller

cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard