cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard
Memory leak in the controller
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Deploy CAPZ v1.3.2
and let it run for some time.
Memory leak in the controller even when there are no Azure/AKS clusters.
Here is a graph with just a single non-Azure/AKS cluster:
A steeper graph when scaling to ~100 clusters (still 0 Azure/AKS clusters)
What did you expect to happen: The controller should be stable when there are no Azure/AKS clusters.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I enabled pprof to capture some info, here is what it looked like when first starting up
And then again after a couple of hours:
It pointed us to a potential culprit, a growing slice: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/util/tele/span_logger.go#L92
I added some debug logging on the length and capacity of that slice and after a few hours its >100,000 KV pairs.
A different view in pprof provided some more info on where to look:
It doesn't give the exact function, but some hints towards https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/controllers/azurecluster_controller.go#L88
Looks like the log
is instantiated once when SetupWithManager
is called in main.go
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/392137a94135a6995867d8c7fdfd56fabad719fc/controllers/azurecluster_controller.go#L73-L78
This is different from how other providers setup their loggers and gets a new instance of it on every reconcile and does not have a growing slice of KV pairs.
Environment:
- cluster-api-provider-azure version:
v1.3.2
- Kubernetes version: (use
kubectl version
): - OS (e.g. from
/etc/os-release
):
I thought that maybe the way the KVs are appeneded to the slice may be causing an issue in Go and tried a similar approach to how its done in https://github.com/kubernetes/klog/blob/02fe3234c86b222bc85406122bd0d11acbe5363a/internal/serialize/keyvalues.go#L39-L42
But that just leads to a spiky memory leak. Notice how the spikes happen every 10 minutes, on a periodic reconcile.
All sings point to a bug in how the log
gets reused and KVs are appended to it across multiple reconciles.
@dkoshkin this is excellent (not that there's a leak, but that you did this profiling work)!
I assume the problem exists in 1.4.2 and main as well, but when time permits I'll try pprof
on that.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The problem still exists in v1.7.1
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Matt couldn't reproduce this behavior and we are going to be doing extensive performance testing with https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3547 It has also been a long time since this issue has been opened (with numerous releases since then). For these reasons, we're marking this as needs-more-evidence.
I'm still seeing that in 1.8.5, is it already too old to be considered?
Otherwise I'm happy to provide more logs/details.
I'd like to see if it still exists in the latest 1.10.x.
I also need to ask - Is the load on CAPZ increasing over time by chance? e.g. are you adding more clusters to manage over the course of a day?
@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.
Hey @jonathanbeber , Did you get a chance to probe further. Or do you have any recommended steps in probing this issue so that we can take it over and check it ourselves?
I'm sorry, @nawazkh, I don't have any contact with CAPZ controllers anymore. You will be able to see the you by simply running the controller and having some metrics collector in place (e.g. prometheus + grafana).
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten