Memory leak in source-controller v1.5.0
Observed behavior
In one of our clusters, source-controller memory usage grows linearly over time until it reaches the memory limit, at which point Kubernetes restarts the pod.
- Growth rate is slow but steady (e.g., 250MiB → 1.3GiB over ~4 weeks).
- No significant drops in memory usage during runtime (even after reconciliations complete).
- Restarts reset memory to baseline, but the pattern repeats.
On other cluster the source-controller works fine.
Flux system setup is same on both clusters. Could you please help me find the root cause?
Flux version
flux --version: Flux v2.5.1 source-controller image: ghcr.io/fluxcd/source-controller:v1.5.0
flux stats -A on problematic cluster
RECONCILERS RUNNING FAILING SUSPENDED STORAGE
GitRepository 28 1 1 5.0 MiB
OCIRepository 95 7 0 1.5 MiB
HelmRepository 10 0 0 12.0 MiB
HelmChart 11 0 0 1.2 MiB
Bucket 0 0 0 -
Kustomization 343 9 43 -
HelmRelease 11 0 0 -
Alert 538 0 0 -
Provider 81 0 0 -
Receiver 18 0 0 -
ImageUpdateAutomation 0 0 0 -
ImagePolicy 0 0 0 -
ImageRepository 0 0 0 -
flux stats -A on OK cluster
RECONCILERS RUNNING FAILING SUSPENDED STORAGE
GitRepository 28 1 1 5.0 MiB
OCIRepository 100 8 0 1.5 MiB
HelmRepository 8 0 0 11.2 MiB
HelmChart 10 0 0 1.1 MiB
Bucket 0 0 0 -
Kustomization 350 12 40 -
HelmRelease 10 1 0 -
Alert 548 0 0 -
Provider 90 0 0 -
Receiver 13 0 0 -
ImageUpdateAutomation 0 0 0 -
ImagePolicy 0 0 0 -
ImageRepository 0 0 0 -
Can you please collect a heap profile for source-controller after running it for 2 weeks and share it with us. See here how to collect the profile https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile
Unfortunately I am not allowed to share whole file. However, here are the top 10 nodes for the pod, which has been running for over 30 days.
Showing top 10 nodes out of 272
flat flat% sum% cum cum%
4608.27kB 16.15% 16.15% 4608.27kB 16.15% encoding/json.(*decodeState).literalStore
4096.73kB 14.36% 30.52% 4096.73kB 14.36% reflect.New
1767.16kB 6.19% 36.71% 1767.16kB 6.19% bytes.growSlice
1097.69kB 3.85% 40.56% 1097.69kB 3.85% k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
1032.22kB 3.62% 44.18% 1032.22kB 3.62% k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
1028kB 3.60% 47.78% 1028kB 3.60% bufio.NewWriterSize
1024.31kB 3.59% 51.37% 1024.31kB 3.59% strings.(*Builder).grow
1024.09kB 3.59% 54.96% 1024.09kB 3.59% fmt.Sprintf
528.17kB 1.85% 56.81% 528.17kB 1.85% k8s.io/apimachinery/pkg/watch.(*Broadcaster).Watch.func1
521.05kB 1.83% 58.64% 521.05kB 1.83% k8s.io/utils/buffer.NewRingGrowing
top -cum
Showing nodes accounting for 512.01kB, 1.79% of 28526.07kB total
Showing top 10 nodes out of 272
flat flat% sum% cum cum%
0 0% 0% 14348.02kB 50.30% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Reconcile
0 0% 0% 14348.02kB 50.30% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2.2
0 0% 0% 14348.02kB 50.30% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).processNextWorkItem
0 0% 0% 14348.02kB 50.30% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).reconcileHandler
0 0% 0% 9729.15kB 34.11% encoding/json.(*decodeState).array
512.01kB 1.79% 1.79% 9729.15kB 34.11% encoding/json.(*decodeState).object
0 0% 1.79% 9729.15kB 34.11% encoding/json.(*decodeState).unmarshal
0 0% 1.79% 9729.15kB 34.11% encoding/json.(*decodeState).value
0 0% 1.79% 9729.15kB 34.11% github.com/fluxcd/source-controller/internal/helm/repository.(*ChartRepository).LoadFromPath
0 0% 1.79% 9729.15kB 34.11% github.com/fluxcd/source-controller/internal/helm/repository.IndexFromBytes
(pprof)
For comparison, here is the same profile for the OK pod in other cluster.
Showing top 10 nodes out of 258
flat flat% sum% cum cum%
2571.05kB 10.03% 10.03% 2571.05kB 10.03% bytes.growSlice
2560.67kB 9.99% 20.02% 2560.67kB 9.99% strings.(*Builder).grow
1542.01kB 6.02% 26.03% 1542.01kB 6.02% bufio.NewWriterSize
1039.10kB 4.05% 30.09% 1039.10kB 4.05% regexp/syntax.(*compiler).inst
1036.12kB 4.04% 34.13% 1036.12kB 4.04% k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
1024.44kB 4.00% 38.12% 1024.44kB 4.00% runtime.malg
768.26kB 3.00% 41.12% 768.26kB 3.00% go.uber.org/zap/zapcore.newCounters
632.14kB 2.47% 43.59% 632.14kB 2.47% k8s.io/utils/internal/third_party/forked/golang/golang-lru.(*Cache).Add
553.04kB 2.16% 45.74% 553.04kB 2.16% google.golang.org/protobuf/internal/strs.(*Builder).grow
553.04kB 2.16% 47.90% 553.04kB 2.16% google.golang.org/protobuf/reflect/protoregistry.(*Files).RegisterFile
top -cum
Showing top 10 nodes out of 258
flat flat% sum% cum cum%
0 0% 0% 8.36MB 33.40% runtime.main
0 0% 0% 6.11MB 24.41% runtime.doInit (inline)
0 0% 0% 6.11MB 24.41% runtime.doInit1
0 0% 0% 3.62MB 14.45% k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1
0 0% 0% 3.62MB 14.45% k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartRecordingToSink.func1
0 0% 0% 3.62MB 14.45% k8s.io/client-go/tools/record.(*eventBroadcasterImpl).recordToSink
0 0% 0% 3.12MB 12.45% k8s.io/client-go/tools/record.(*EventCorrelator).EventCorrelate
0 0% 0% 3MB 12.00% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2.2
0 0% 0% 3MB 12.00% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).processNextWorkItem
0 0% 0% 3MB 12.00% sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).reconcileHandler
(pprof)
I don't see anything in the profile that would suggest the high memory usage you're observing. Is it possible that the node where this runs has OS issues and memory is never freed?
I suggest enabling the cache of Helm index files and rerun the profile after some time. To enable the cache please see: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching
In your case would be:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gotk-components.yaml
- gotk-sync.yaml
patches:
- patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-max-size=20
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-ttl=168h
- op: add
path: /spec/template/spec/containers/0/args/-
value: --helm-cache-purge-interval=1h
target:
kind: Deployment
name: source-controller
I can try adjusting those settings, but we already have a similar patch in place:
--concurrent=20
--requeue-dependency=5s
--helm-cache-max-size=200
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m