source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

Memory leak in source-controller v1.5.0

Open dan0sh opened this issue 4 months ago • 4 comments

Observed behavior

In one of our clusters, source-controller memory usage grows linearly over time until it reaches the memory limit, at which point Kubernetes restarts the pod.

  • Growth rate is slow but steady (e.g., 250MiB → 1.3GiB over ~4 weeks).
  • No significant drops in memory usage during runtime (even after reconciliations complete).
  • Restarts reset memory to baseline, but the pattern repeats.

Image

On other cluster the source-controller works fine.

Image

Flux system setup is same on both clusters. Could you please help me find the root cause?

Flux version

flux --version: Flux v2.5.1 source-controller image: ghcr.io/fluxcd/source-controller:v1.5.0

flux stats -A on problematic cluster

RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE  
GitRepository           28      1       1               5.0 MiB         
OCIRepository           95      7       0               1.5 MiB         
HelmRepository          10      0       0               12.0 MiB        
HelmChart               11      0       0               1.2 MiB         
Bucket                  0       0       0               -               
Kustomization           343     9       43              -               
HelmRelease             11      0       0               -               
Alert                   538     0       0               -               
Provider                81      0       0               -               
Receiver                18      0       0               -               
ImageUpdateAutomation   0       0       0               -               
ImagePolicy             0       0       0               -               
ImageRepository         0       0       0               -     

flux stats -A on OK cluster

RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE  
GitRepository           28      1       1               5.0 MiB         
OCIRepository           100     8       0               1.5 MiB         
HelmRepository          8       0       0               11.2 MiB        
HelmChart               10      0       0               1.1 MiB         
Bucket                  0       0       0               -               
Kustomization           350     12      40              -               
HelmRelease             10      1       0               -               
Alert                   548     0       0               -               
Provider                90      0       0               -               
Receiver                13      0       0               -               
ImageUpdateAutomation   0       0       0               -               
ImagePolicy             0       0       0               -               
ImageRepository         0       0       0               -  

dan0sh avatar Aug 13 '25 08:08 dan0sh

Can you please collect a heap profile for source-controller after running it for 2 weeks and share it with us. See here how to collect the profile https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile

stefanprodan avatar Aug 13 '25 08:08 stefanprodan

Unfortunately I am not allowed to share whole file. However, here are the top 10 nodes for the pod, which has been running for over 30 days.

Showing top 10 nodes out of 272
      flat  flat%   sum%        cum   cum%
 4608.27kB 16.15% 16.15%  4608.27kB 16.15%  encoding/json.(*decodeState).literalStore
 4096.73kB 14.36% 30.52%  4096.73kB 14.36%  reflect.New
 1767.16kB  6.19% 36.71%  1767.16kB  6.19%  bytes.growSlice
 1097.69kB  3.85% 40.56%  1097.69kB  3.85%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
 1032.22kB  3.62% 44.18%  1032.22kB  3.62%  k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
    1028kB  3.60% 47.78%     1028kB  3.60%  bufio.NewWriterSize
 1024.31kB  3.59% 51.37%  1024.31kB  3.59%  strings.(*Builder).grow
 1024.09kB  3.59% 54.96%  1024.09kB  3.59%  fmt.Sprintf
  528.17kB  1.85% 56.81%   528.17kB  1.85%  k8s.io/apimachinery/pkg/watch.(*Broadcaster).Watch.func1
  521.05kB  1.83% 58.64%   521.05kB  1.83%  k8s.io/utils/buffer.NewRingGrowing

top -cum

Showing nodes accounting for 512.01kB, 1.79% of 28526.07kB total
Showing top 10 nodes out of 272
      flat  flat%   sum%        cum   cum%
         0     0%     0% 14348.02kB 50.30%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Reconcile
         0     0%     0% 14348.02kB 50.30%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2.2
         0     0%     0% 14348.02kB 50.30%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).processNextWorkItem
         0     0%     0% 14348.02kB 50.30%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).reconcileHandler
         0     0%     0%  9729.15kB 34.11%  encoding/json.(*decodeState).array
  512.01kB  1.79%  1.79%  9729.15kB 34.11%  encoding/json.(*decodeState).object
         0     0%  1.79%  9729.15kB 34.11%  encoding/json.(*decodeState).unmarshal
         0     0%  1.79%  9729.15kB 34.11%  encoding/json.(*decodeState).value
         0     0%  1.79%  9729.15kB 34.11%  github.com/fluxcd/source-controller/internal/helm/repository.(*ChartRepository).LoadFromPath
         0     0%  1.79%  9729.15kB 34.11%  github.com/fluxcd/source-controller/internal/helm/repository.IndexFromBytes
(pprof) 

For comparison, here is the same profile for the OK pod in other cluster.

Showing top 10 nodes out of 258
      flat  flat%   sum%        cum   cum%
 2571.05kB 10.03% 10.03%  2571.05kB 10.03%  bytes.growSlice
 2560.67kB  9.99% 20.02%  2560.67kB  9.99%  strings.(*Builder).grow
 1542.01kB  6.02% 26.03%  1542.01kB  6.02%  bufio.NewWriterSize
 1039.10kB  4.05% 30.09%  1039.10kB  4.05%  regexp/syntax.(*compiler).inst
 1036.12kB  4.04% 34.13%  1036.12kB  4.04%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
 1024.44kB  4.00% 38.12%  1024.44kB  4.00%  runtime.malg
  768.26kB  3.00% 41.12%   768.26kB  3.00%  go.uber.org/zap/zapcore.newCounters
  632.14kB  2.47% 43.59%   632.14kB  2.47%  k8s.io/utils/internal/third_party/forked/golang/golang-lru.(*Cache).Add
  553.04kB  2.16% 45.74%   553.04kB  2.16%  google.golang.org/protobuf/internal/strs.(*Builder).grow
  553.04kB  2.16% 47.90%   553.04kB  2.16%  google.golang.org/protobuf/reflect/protoregistry.(*Files).RegisterFile

top -cum

Showing top 10 nodes out of 258
      flat  flat%   sum%        cum   cum%
         0     0%     0%     8.36MB 33.40%  runtime.main
         0     0%     0%     6.11MB 24.41%  runtime.doInit (inline)
         0     0%     0%     6.11MB 24.41%  runtime.doInit1
         0     0%     0%     3.62MB 14.45%  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1
         0     0%     0%     3.62MB 14.45%  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartRecordingToSink.func1
         0     0%     0%     3.62MB 14.45%  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).recordToSink
         0     0%     0%     3.12MB 12.45%  k8s.io/client-go/tools/record.(*EventCorrelator).EventCorrelate
         0     0%     0%        3MB 12.00%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).Start.func2.2
         0     0%     0%        3MB 12.00%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).processNextWorkItem
         0     0%     0%        3MB 12.00%  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[go.shape.struct { k8s.io/apimachinery/pkg/types.NamespacedName }]).reconcileHandler
(pprof) 

dan0sh avatar Aug 13 '25 11:08 dan0sh

I don't see anything in the profile that would suggest the high memory usage you're observing. Is it possible that the node where this runs has OS issues and memory is never freed?

I suggest enabling the cache of Helm index files and rerun the profile after some time. To enable the cache please see: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

In your case would be:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
patches:
  - patch: |
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-max-size=20
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-ttl=168h
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --helm-cache-purge-interval=1h      
    target:
      kind: Deployment
      name: source-controller

stefanprodan avatar Aug 13 '25 13:08 stefanprodan

I can try adjusting those settings, but we already have a similar patch in place:

--concurrent=20
--requeue-dependency=5s
--helm-cache-max-size=200
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m

dan0sh avatar Aug 13 '25 13:08 dan0sh