helm-controller
helm-controller copied to clipboard
`helm-controller` Pod gets OOM-killed even with 1GB of RAM
Describe the bug
Title says its all. Here is the Pod definition:
apiVersion: v1
kind: Pod
metadata:
name: helm-controller-f56848c5-gsd44
generateName: helm-controller-f56848c5-
namespace: flux-system
uid: 5959073e-cf82-4d65-8925-9ece92fb366c
resourceVersion: '408070363'
creationTimestamp: '2021-11-02T11:08:39Z'
labels:
app: helm-controller
pod-template-hash: f56848c5
annotations:
prometheus.io/port: '8080'
prometheus.io/scrape: 'true'
ownerReferences:
- apiVersion: apps/v1
kind: ReplicaSet
name: helm-controller-f56848c5
uid: e748e195-06e5-411d-acbf-005c180a47ed
controller: true
blockOwnerDeletion: true
managedFields:
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: '2021-11-02T11:08:39Z'
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:annotations':
.: {}
'f:prometheus.io/port': {}
'f:prometheus.io/scrape': {}
'f:generateName': {}
'f:labels':
.: {}
'f:app': {}
'f:pod-template-hash': {}
'f:ownerReferences':
.: {}
'k:{"uid":"e748e195-06e5-411d-acbf-005c180a47ed"}':
.: {}
'f:apiVersion': {}
'f:blockOwnerDeletion': {}
'f:controller': {}
'f:kind': {}
'f:name': {}
'f:uid': {}
'f:spec':
'f:containers':
'k:{"name":"manager"}':
.: {}
'f:args': {}
'f:env':
.: {}
'k:{"name":"RUNTIME_NAMESPACE"}':
.: {}
'f:name': {}
'f:valueFrom':
.: {}
'f:fieldRef':
.: {}
'f:apiVersion': {}
'f:fieldPath': {}
'f:image': {}
'f:imagePullPolicy': {}
'f:livenessProbe':
.: {}
'f:failureThreshold': {}
'f:httpGet':
.: {}
'f:path': {}
'f:port': {}
'f:scheme': {}
'f:periodSeconds': {}
'f:successThreshold': {}
'f:timeoutSeconds': {}
'f:name': {}
'f:ports':
.: {}
'k:{"containerPort":8080,"protocol":"TCP"}':
.: {}
'f:containerPort': {}
'f:name': {}
'f:protocol': {}
'k:{"containerPort":9440,"protocol":"TCP"}':
.: {}
'f:containerPort': {}
'f:name': {}
'f:protocol': {}
'f:readinessProbe':
.: {}
'f:failureThreshold': {}
'f:httpGet':
.: {}
'f:path': {}
'f:port': {}
'f:scheme': {}
'f:periodSeconds': {}
'f:successThreshold': {}
'f:timeoutSeconds': {}
'f:resources':
.: {}
'f:limits':
.: {}
'f:cpu': {}
'f:memory': {}
'f:requests':
.: {}
'f:cpu': {}
'f:memory': {}
'f:securityContext':
.: {}
'f:allowPrivilegeEscalation': {}
'f:readOnlyRootFilesystem': {}
'f:terminationMessagePath': {}
'f:terminationMessagePolicy': {}
'f:volumeMounts':
.: {}
'k:{"mountPath":"/tmp"}':
.: {}
'f:mountPath': {}
'f:name': {}
'f:dnsPolicy': {}
'f:enableServiceLinks': {}
'f:imagePullSecrets':
.: {}
'k:{"name":"redacted"}':
.: {}
'f:name': {}
'f:nodeSelector':
.: {}
'f:kubernetes.io/os': {}
'f:restartPolicy': {}
'f:schedulerName': {}
'f:securityContext': {}
'f:serviceAccount': {}
'f:serviceAccountName': {}
'f:terminationGracePeriodSeconds': {}
'f:volumes':
.: {}
'k:{"name":"temp"}':
.: {}
'f:emptyDir': {}
'f:name': {}
- manager: kubelet
operation: Update
apiVersion: v1
time: '2021-11-02T17:40:39Z'
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions':
'k:{"type":"ContainersReady"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:status': {}
'f:type': {}
'k:{"type":"Initialized"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:status': {}
'f:type': {}
'k:{"type":"Ready"}':
.: {}
'f:lastProbeTime': {}
'f:lastTransitionTime': {}
'f:status': {}
'f:type': {}
'f:containerStatuses': {}
'f:hostIP': {}
'f:phase': {}
'f:podIP': {}
'f:podIPs':
.: {}
'k:{"ip":"10.233.79.245"}':
.: {}
'f:ip': {}
'f:startTime': {}
selfLink: /api/v1/namespaces/flux-system/pods/helm-controller-f56848c5-gsd44
status:
phase: Running
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-11-02T11:08:39Z'
- type: Ready
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-11-02T17:40:39Z'
- type: ContainersReady
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-11-02T17:40:39Z'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-11-02T11:08:39Z'
hostIP: 10.1.44.10
podIP: 10.233.79.245
podIPs:
- ip: 10.233.79.245
startTime: '2021-11-02T11:08:39Z'
containerStatuses:
- name: manager
state:
running:
startedAt: '2021-11-02T17:40:30Z'
lastState:
terminated:
exitCode: 137
reason: OOMKilled
startedAt: '2021-11-02T11:08:40Z'
finishedAt: '2021-11-02T17:40:29Z'
containerID: >-
docker://d9a012aaadf8fc05ab30bcb1e18eb071ddc648a6e036c8e45a599e7583438b57
ready: true
restartCount: 1
image: 'ghcr.io/fluxcd/helm-controller:v0.12.1'
imageID: >-
docker-pullable://ghcr.io/fluxcd/helm-controller@sha256:74b0442a90350b1de9fb34e3180c326d1d7814caa14bf5501750a71a1782d10d
containerID: >-
docker://cfd5ac78013a3fb1d80ed4ddff1ae3eb217b8be0dd2a0eff6b37922106ea372e
started: true
qosClass: Burstable
spec:
volumes:
- name: temp
emptyDir: {}
- name: kube-api-access-wzqn6
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
containers:
- name: manager
image: 'ghcr.io/fluxcd/helm-controller:v0.12.1'
args:
- '--events-addr=http://notification-controller/'
- '--watch-all-namespaces=true'
- '--log-level=debug'
- '--log-encoding=json'
- '--enable-leader-election'
ports:
- name: http-prom
containerPort: 8080
protocol: TCP
- name: healthz
containerPort: 9440
protocol: TCP
env:
- name: RUNTIME_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: 100m
memory: 64Mi
volumeMounts:
- name: temp
mountPath: /tmp
- name: kube-api-access-wzqn6
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
livenessProbe:
httpGet:
path: /healthz
port: healthz
scheme: HTTP
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: healthz
scheme: HTTP
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
restartPolicy: Always
terminationGracePeriodSeconds: 600
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: helm-controller
serviceAccount: helm-controller
nodeName: sigma01
securityContext: {}
imagePullSecrets:
- name: redacted
schedulerName: default-scheduler
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
priority: 0
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
Steps to reproduce
Not sure how to reproduce. Probably dependant on cluster and repository size. Most of the resources (about 20-30) are set to 1-minute reconciliation.
Expected behavior
The helm-controller
to run for months without OOM.
Screenshots and recordings
No response
OS / Distro
Flatcar 2905.2.6
Flux version
flux version 0.21.1
Flux check
► checking prerequisites ✗ flux 0.20.1 <0.21.0 (new version is available, please upgrade) ✔ Kubernetes 1.21.5 >=1.19.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.12.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.16.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.13.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.16.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.18.1 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.17.1 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Possible duplicate of https://github.com/fluxcd/helm-controller/issues/345.
We are gathering details on this at present, as it looks like a recent change has introduced a serious increase of memory during operation. The controller itself has not seen any relevant changes besides dependency updates (Helm, K8s, kustomize, controller-runtime). If you happened to run an older Flux version before this that had a lower memory footprint (for some, 0.10.1
performed much better), it would be valuable to me know this version.
Having looked a bit further into it more just now, there are two changes that could be pointers:
- Waiting for jobs in
v0.11.0
https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0110 ~- Change in Helm upgrade behavior introduced inv3.7.0
and updated to inv0.12.0
https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0120~
If both of these versions appear to work fine, it will need a much deeper dive.
v0.11.2
seems to misbehave for people as well.
I'm running the latest release 0.25.2 and have assigned helm controller a limit of 2Gi and it's still very killed for OOM. This is with around 25 HelmReleases on the cluster, checking every 5 minutes
Running the helm controller 0.15.0 with around 20 HelmReleases and checks every 5 minutes without limits on resources and it reaches 3.5GB on memory and 1 CPU. We removed the limits as we were getting errors on the helm side if the pod was restarted while upgrading.
helm-controller v0.30.0 still seems to have this issue.
Upgrading to Flux 2.1 and configuring Helm index caching should fix this: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching