helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

`helm-controller` Pod gets OOM-killed even with 1GB of RAM

Open zzvara opened this issue 2 years ago • 3 comments

Describe the bug

Title says its all. Here is the Pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: helm-controller-f56848c5-gsd44
  generateName: helm-controller-f56848c5-
  namespace: flux-system
  uid: 5959073e-cf82-4d65-8925-9ece92fb366c
  resourceVersion: '408070363'
  creationTimestamp: '2021-11-02T11:08:39Z'
  labels:
    app: helm-controller
    pod-template-hash: f56848c5
  annotations:
    prometheus.io/port: '8080'
    prometheus.io/scrape: 'true'
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: helm-controller-f56848c5
      uid: e748e195-06e5-411d-acbf-005c180a47ed
      controller: true
      blockOwnerDeletion: true
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2021-11-02T11:08:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:prometheus.io/port': {}
            'f:prometheus.io/scrape': {}
          'f:generateName': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:pod-template-hash': {}
          'f:ownerReferences':
            .: {}
            'k:{"uid":"e748e195-06e5-411d-acbf-005c180a47ed"}':
              .: {}
              'f:apiVersion': {}
              'f:blockOwnerDeletion': {}
              'f:controller': {}
              'f:kind': {}
              'f:name': {}
              'f:uid': {}
        'f:spec':
          'f:containers':
            'k:{"name":"manager"}':
              .: {}
              'f:args': {}
              'f:env':
                .: {}
                'k:{"name":"RUNTIME_NAMESPACE"}':
                  .: {}
                  'f:name': {}
                  'f:valueFrom':
                    .: {}
                    'f:fieldRef':
                      .: {}
                      'f:apiVersion': {}
                      'f:fieldPath': {}
              'f:image': {}
              'f:imagePullPolicy': {}
              'f:livenessProbe':
                .: {}
                'f:failureThreshold': {}
                'f:httpGet':
                  .: {}
                  'f:path': {}
                  'f:port': {}
                  'f:scheme': {}
                'f:periodSeconds': {}
                'f:successThreshold': {}
                'f:timeoutSeconds': {}
              'f:name': {}
              'f:ports':
                .: {}
                'k:{"containerPort":8080,"protocol":"TCP"}':
                  .: {}
                  'f:containerPort': {}
                  'f:name': {}
                  'f:protocol': {}
                'k:{"containerPort":9440,"protocol":"TCP"}':
                  .: {}
                  'f:containerPort': {}
                  'f:name': {}
                  'f:protocol': {}
              'f:readinessProbe':
                .: {}
                'f:failureThreshold': {}
                'f:httpGet':
                  .: {}
                  'f:path': {}
                  'f:port': {}
                  'f:scheme': {}
                'f:periodSeconds': {}
                'f:successThreshold': {}
                'f:timeoutSeconds': {}
              'f:resources':
                .: {}
                'f:limits':
                  .: {}
                  'f:cpu': {}
                  'f:memory': {}
                'f:requests':
                  .: {}
                  'f:cpu': {}
                  'f:memory': {}
              'f:securityContext':
                .: {}
                'f:allowPrivilegeEscalation': {}
                'f:readOnlyRootFilesystem': {}
              'f:terminationMessagePath': {}
              'f:terminationMessagePolicy': {}
              'f:volumeMounts':
                .: {}
                'k:{"mountPath":"/tmp"}':
                  .: {}
                  'f:mountPath': {}
                  'f:name': {}
          'f:dnsPolicy': {}
          'f:enableServiceLinks': {}
          'f:imagePullSecrets':
            .: {}
            'k:{"name":"redacted"}':
              .: {}
              'f:name': {}
          'f:nodeSelector':
            .: {}
            'f:kubernetes.io/os': {}
          'f:restartPolicy': {}
          'f:schedulerName': {}
          'f:securityContext': {}
          'f:serviceAccount': {}
          'f:serviceAccountName': {}
          'f:terminationGracePeriodSeconds': {}
          'f:volumes':
            .: {}
            'k:{"name":"temp"}':
              .: {}
              'f:emptyDir': {}
              'f:name': {}
    - manager: kubelet
      operation: Update
      apiVersion: v1
      time: '2021-11-02T17:40:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions':
            'k:{"type":"ContainersReady"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:status': {}
              'f:type': {}
            'k:{"type":"Initialized"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:status': {}
              'f:type': {}
            'k:{"type":"Ready"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:status': {}
              'f:type': {}
          'f:containerStatuses': {}
          'f:hostIP': {}
          'f:phase': {}
          'f:podIP': {}
          'f:podIPs':
            .: {}
            'k:{"ip":"10.233.79.245"}':
              .: {}
              'f:ip': {}
          'f:startTime': {}
  selfLink: /api/v1/namespaces/flux-system/pods/helm-controller-f56848c5-gsd44
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T11:08:39Z'
    - type: Ready
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T17:40:39Z'
    - type: ContainersReady
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T17:40:39Z'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T11:08:39Z'
  hostIP: 10.1.44.10
  podIP: 10.233.79.245
  podIPs:
    - ip: 10.233.79.245
  startTime: '2021-11-02T11:08:39Z'
  containerStatuses:
    - name: manager
      state:
        running:
          startedAt: '2021-11-02T17:40:30Z'
      lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2021-11-02T11:08:40Z'
          finishedAt: '2021-11-02T17:40:29Z'
          containerID: >-
            docker://d9a012aaadf8fc05ab30bcb1e18eb071ddc648a6e036c8e45a599e7583438b57
      ready: true
      restartCount: 1
      image: 'ghcr.io/fluxcd/helm-controller:v0.12.1'
      imageID: >-
        docker-pullable://ghcr.io/fluxcd/helm-controller@sha256:74b0442a90350b1de9fb34e3180c326d1d7814caa14bf5501750a71a1782d10d
      containerID: >-
        docker://cfd5ac78013a3fb1d80ed4ddff1ae3eb217b8be0dd2a0eff6b37922106ea372e
      started: true
  qosClass: Burstable
spec:
  volumes:
    - name: temp
      emptyDir: {}
    - name: kube-api-access-wzqn6
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: manager
      image: 'ghcr.io/fluxcd/helm-controller:v0.12.1'
      args:
        - '--events-addr=http://notification-controller/'
        - '--watch-all-namespaces=true'
        - '--log-level=debug'
        - '--log-encoding=json'
        - '--enable-leader-election'
      ports:
        - name: http-prom
          containerPort: 8080
          protocol: TCP
        - name: healthz
          containerPort: 9440
          protocol: TCP
      env:
        - name: RUNTIME_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
      resources:
        limits:
          cpu: '1'
          memory: 1Gi
        requests:
          cpu: 100m
          memory: 64Mi
      volumeMounts:
        - name: temp
          mountPath: /tmp
        - name: kube-api-access-wzqn6
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      livenessProbe:
        httpGet:
          path: /healthz
          port: healthz
          scheme: HTTP
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /readyz
          port: healthz
          scheme: HTTP
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
  restartPolicy: Always
  terminationGracePeriodSeconds: 600
  dnsPolicy: ClusterFirst
  nodeSelector:
    kubernetes.io/os: linux
  serviceAccountName: helm-controller
  serviceAccount: helm-controller
  nodeName: sigma01
  securityContext: {}
  imagePullSecrets:
    - name: redacted
  schedulerName: default-scheduler
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  priority: 0
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority

Steps to reproduce

Not sure how to reproduce. Probably dependant on cluster and repository size. Most of the resources (about 20-30) are set to 1-minute reconciliation.

Expected behavior

The helm-controller to run for months without OOM.

Screenshots and recordings

No response

OS / Distro

Flatcar 2905.2.6

Flux version

flux version 0.21.1

Flux check

► checking prerequisites ✗ flux 0.20.1 <0.21.0 (new version is available, please upgrade) ✔ Kubernetes 1.21.5 >=1.19.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.12.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.16.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.13.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.16.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.18.1 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.17.1 ✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

zzvara avatar Nov 02 '21 19:11 zzvara

Possible duplicate of https://github.com/fluxcd/helm-controller/issues/345.

We are gathering details on this at present, as it looks like a recent change has introduced a serious increase of memory during operation. The controller itself has not seen any relevant changes besides dependency updates (Helm, K8s, kustomize, controller-runtime). If you happened to run an older Flux version before this that had a lower memory footprint (for some, 0.10.1 performed much better), it would be valuable to me know this version.

Having looked a bit further into it more just now, there are two changes that could be pointers:

  • Waiting for jobs in v0.11.0 https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0110 ~- Change in Helm upgrade behavior introduced in v3.7.0 and updated to in v0.12.0 https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0120~

If both of these versions appear to work fine, it will need a much deeper dive.

v0.11.2 seems to misbehave for people as well.

hiddeco avatar Nov 02 '21 21:11 hiddeco

I'm running the latest release 0.25.2 and have assigned helm controller a limit of 2Gi and it's still very killed for OOM. This is with around 25 HelmReleases on the cluster, checking every 5 minutes

barrydobson avatar Jan 18 '22 19:01 barrydobson

Running the helm controller 0.15.0 with around 20 HelmReleases and checks every 5 minutes without limits on resources and it reaches 3.5GB on memory and 1 CPU. We removed the limits as we were getting errors on the helm side if the pod was restarted while upgrading.

CosminBriscaru avatar Jan 25 '22 14:01 CosminBriscaru

helm-controller v0.30.0 still seems to have this issue.

applike-ss avatar Mar 01 '23 10:03 applike-ss

Upgrading to Flux 2.1 and configuring Helm index caching should fix this: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

stefanprodan avatar Oct 11 '23 06:10 stefanprodan