Reloader icon indicating copy to clipboard operation
Reloader copied to clipboard

[BUG] reloader consumes a lot of memory

Open Evalle opened this issue 4 months ago • 13 comments

Describe the bug reloader is using a lot of memory

To Reproduce deploy the a latest version of reloader to a kubernetes cluster, check the memory metrics.

Expected behavior reloader runs normally

Screenshots

Cluster Number of Secrets Number of Configmaps Number of Pods Number of Deployments Number Of Namespaces
A 1736 1050 8944 520 299
B 1805 796 1937 550 216

On cluster A the memory consumption of reloader is around 80 MB:

Image On cluster B it is 16GB: Image

Cluster configuration is the same. So ideally in the debug messages I would like to see what is going on there and why so much difference. Thx.

Environment

  • reloader version: 1.4.5
  • Kubernetes Version: 1.31.6

Evalle avatar Jul 23 '25 14:07 Evalle

Can you share the reloader's deployment yaml? Are there any noticable differences between these two clusters i.e. their cluster configuration or any other significant workload difference like ArgoCD or anything?

msafwankarim avatar Jul 24 '25 10:07 msafwankarim

hey @msafwankarim

Can you share the reloader's deployment yaml

sure,

apiVersion: v1
kind: Namespace
metadata:
  name: reloader
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: reloader-reloader
  namespace: reloader
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: reloader-reloader-role
rules:
- apiGroups:
  - ""
  resources:
  - secrets
  - configmaps
  verbs:
  - list
  - get
  - watch
- apiGroups:
  - apps
  resources:
  - deployments
  - daemonsets
  - statefulsets
  verbs:
  - list
  - get
  - update
  - patch
- apiGroups:
  - extensions
  resources:
  - deployments
  - daemonsets
  verbs:
  - list
  - get
  - update
  - patch
- apiGroups:
  - batch
  resources:
  - cronjobs
  verbs:
  - list
  - get
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: reloader-reloader-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: reloader-reloader-role
subjects:
- kind: ServiceAccount
  name: reloader-reloader
  namespace: reloader
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reloader-reloader
  namespace: reloader
spec:
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: reloader-reloader
  template:
    metadata:
      labels:
        app: reloader-reloader
    spec:
      containers:
      - args:
        - --reload-strategy=annotations
        env:
        - name: GOMAXPROCS
          valueFrom:
            resourceFieldRef:
              resource: limits.cpu
        - name: GOMEMLIMIT
          valueFrom:
            resourceFieldRef:
              resource: limits.memory
        image: ghcr-io/stakater/reloader:v1.4.5
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /live
            port: http
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: reloader-reloader
        ports:
        - containerPort: 9090
          name: http
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: http
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 200m
            memory: 16000Mi
          requests:
            cpu: 50m
            memory: 100Mi
      serviceAccountName: reloader-reloader

Are there any noticable differences between these two clusters i.e. their cluster configuration or any other significant workload difference like ArgoCD or anything?

I couldn't find any, I was hoping the debug or trace log level messages will help me to understand what is going there.

Evalle avatar Jul 24 '25 11:07 Evalle

Thank you for sharing this info. Can you tell me which annotations are you using on workloads with Reloader. I'm trying to narrow down the possibiliites as we've tried reproducing it on our cluster but failed to do so. Are you using reloader with reloader.stakater.com/auto annotation or any other ones?

msafwankarim avatar Jul 24 '25 14:07 msafwankarim

Are you using reloader with reloader.stakater.com/auto annotation or any other ones?

we do use reloader.stakater.com/auto annotations only. The interesting thing is that only a couple of deployments in each cluster are actually using reloader.

Evalle avatar Jul 24 '25 14:07 Evalle

A couple of updates:

  1. CPU throttling graph looks like this: Image

  2. I'm also trying to exclude some of the namespaces with the biggest amount of secrets and configmaps via:

     - --namespaces-to-ignore=foo,bar

the log is still showing the following so I'm not sure if the namespace exclusion is effective at all:

time="2025-07-25T07:15:12Z" level=info msg="Environment: Kubernetes"
time="2025-07-25T07:15:12Z" level=info msg="Starting Reloader"
time="2025-07-25T07:15:12Z" level=warning msg="KUBERNETES_NAMESPACE is unset, will detect changes in all namespaces."
time="2025-07-25T07:15:12Z" level=info msg="created controller for: secrets"
time="2025-07-25T07:15:12Z" level=info msg="Starting Controller to watch resource type: secrets"
time="2025-07-25T07:15:12Z" level=info msg="created controller for: configMaps"
time="2025-07-25T07:15:12Z" level=info msg="Starting Controller to watch resource type: configMaps"
  1. Also, I changed reloader permissions for cronjobs and jobs in the cluster, I thought that maybe some of them are causing reloader to leak the memory but it doesn't look like that is the case.

Evalle avatar Jul 25 '25 08:07 Evalle

@msafwankarim I can see that you were working on the pprof support for reloader recently: https://github.com/stakater/Reloader/pull/961 this is something that could help us here as well

Evalle avatar Jul 25 '25 08:07 Evalle

I checked some memory stats via ephemeral sidecar containers. Here is the data:

reloader-reloader-598bff4866-fxf7g:/root$ ps aux
PID   USER     TIME  COMMAND
    1 nobody    6:42 /manager --reload-strategy=annotations
   10 nobody    0:00 /bin/bash
   16 nobody    0:00 ps aux
reloader-reloader-598bff4866-fxf7g:/root$ cat /proc/1/status
Name:	manager
Umask:	0022
State:	S (sleeping)
Tgid:	1
Ngid:	0
Pid:	1
PPid:	0
TracerPid:	0
Uid:	65534	65534	65534	65534
Gid:	65534	65534	65534	65534
FDSize:	64
Groups:	65534
NStgid:	1
NSpid:	1
NSpgid:	1
NSsid:	1
Kthread:	0
VmPeak:	 2488392 kB
VmSize:	 2488392 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	 1312260 kB
VmRSS:	 1312260 kB
RssAnon:	 1285636 kB
RssFile:	   26624 kB
RssShmem:	       0 kB
VmData:	 1325504 kB
VmStk:	     132 kB
VmExe:	   19444 kB
VmLib:	       8 kB
VmPTE:	    2680 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
CoreDumping:	0
THP_enabled:	1
untag_mask:	0xffffffffffffffff
Threads:	4
SigQ:	1/515252
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	fffffffd7fc1feff
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	2
Seccomp_filters:	1
Speculation_Store_Bypass:	thread vulnerable
SpeculationIndirectBranch:	conditional enabled
Cpus_allowed:	ffff
Cpus_allowed_list:	0-15
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	110043
nonvoluntary_ctxt_switches:	1481
reloader-reloader-598bff4866-fxf7g:/root$ cat /proc/1/smaps_rollup
00400000-7fffaa7f6000 ---p 00000000 00:00 0                              [rollup]
Rss:             1392120 kB
Pss:             1392116 kB
Pss_Dirty:       1364936 kB
Pss_Anon:        1364936 kB
Pss_File:          27180 kB
Pss_Shmem:             0 kB
Shared_Clean:          4 kB
Shared_Dirty:          0 kB
Private_Clean:     27180 kB
Private_Dirty:   1364936 kB
Referenced:      1392120 kB
Anonymous:       1364936 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB

Looks like almost all the memory is privately held and dirty = likely a leak or runaway data structure in Go heap. Also /metrics shows that go_memstats_heap_objects is around 6M objects now with reloader_reload_executed_total = 0

Evalle avatar Jul 25 '25 10:07 Evalle

Hi! I've pushed a version with pprof enabled. You can use following helm install command to install it. If the issue is reproduced we can see which type of objects are being created

helm install reloader oci://ghcr.io/stakater/charts/reloader --version 2.1.5 --set image.tag=SNAPSHOT-PR-978-1e4125f0

Reference PR: #978 Image Path: ghcr.io/stakater/reloader:snapshot-pr-978-1e4125f0

msafwankarim avatar Jul 25 '25 11:07 msafwankarim

Thanks, @msafwankarim I'm going for a leave until September, a colleague of mine will be continuing the investigation from our side. We will use pprof and share the results here.

Evalle avatar Jul 25 '25 12:07 Evalle

Hi @Evalle did you guys managed to reproduce the issue?

rasheedamir avatar Nov 29 '25 22:11 rasheedamir

Maybe the same issue as #474?

josegonzalez avatar Dec 04 '25 10:12 josegonzalez

Hey @rasheedamir we were able to find the root cause using the latest version of reloader (--log-level=debug actually worked in the latest version, yay! :) ). reloader was overloaded because it watches all ConfigMaps and Secrets in ouir cluster. A few namespaces were generating a huge number of secret updates. The problem came from two different controllers both trying to manage the same secret—one creating and updating certificate data, and the other maintaining its own keystore file inside that secret. Each controller kept “fixing” the secret after the other changed it, creating a rapid update loop. Reloader had to process every one of these updates, which caused it to use excessive memory and eventually crash (OOMKilled).

Evalle avatar Dec 08 '25 11:12 Evalle

Hi! Glad to hear that your issue has been resolved. Should we close this?

msafwankarim avatar Dec 10 '25 10:12 msafwankarim