image-reflector-controller icon indicating copy to clipboard operation
image-reflector-controller copied to clipboard

Image-reflector-controller restarts due to OOM Killed

Open Andrea-Gallicchio opened this issue 1 year ago • 4 comments

Describe the bug

I run Flux on AWS EKS 1.21.5. I've noticed that after the last Flux update, sometimes happens that the image-reflector-controller pod is restarted due to OOM Killed, even if it has a high CPU and Memory Request/Limit. The number of Helm Releases is between 30 and 40.

  • CPU Request: 0.05
  • CPU Limit: 0.1
  • CPU Average Usage: 0.006
  • Memory Request: 384 MB
  • Memory Limit: 640 MB
  • Memory Average Usage: 187 MB

Steps to reproduce

N/A

Expected behavior

I expect image-reflector-controller to not restart due to OOM Killed.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v0.31.3

Flux check

► checking prerequisites ✔ Kubernetes 1.21.12-eks-a64ea69 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.21.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.22.1 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.18.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.25.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.5 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.24.4 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta1 ✔ buckets.source.toolkit.fluxcd.io/v1beta1 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta1 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta1 ✔ providers.notification.toolkit.fluxcd.io/v1beta1 ✔ receivers.notification.toolkit.fluxcd.io/v1beta1 ✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

Andrea-Gallicchio avatar Jul 13 '22 10:07 Andrea-Gallicchio

The image-reflector-controller has nothing to do with Helm. Can you please post here kubectl describe deployment for the controller that runs into OOM.

stefanprodan avatar Jul 13 '22 10:07 stefanprodan

Name:                   image-reflector-controller
Namespace:              flux-system
CreationTimestamp:      Thu, 23 Dec 2021 11:29:24 +0100
Labels:                 app.kubernetes.io/instance=flux-system
                        app.kubernetes.io/part-of=flux
                        app.kubernetes.io/version=v0.30.2
                        control-plane=controller
                        kustomize.toolkit.fluxcd.io/name=flux-system
                        kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:            deployment.kubernetes.io/revision: 6
Selector:               app=image-reflector-controller
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=image-reflector-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  image-reflector-controller
  Containers:
   manager:
    Image:       ghcr.io/fluxcd/image-reflector-controller:v0.18.0
    Ports:       8080/TCP, 9440/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    Limits:
      cpu:     100m
      memory:  640Mi
    Requests:
      cpu:      50m
      memory:   384Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:   (v1:metadata.namespace)
    Mounts:
      /data from data (rw)
      /tmp from temp (rw)
  Volumes:
   temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   image-reflector-controller-db97c765d (1/1 replicas created)
Events:          <none>

Andrea-Gallicchio avatar Jul 13 '22 12:07 Andrea-Gallicchio

@Andrea-Gallicchio can you confirm whether just before the OOM occurred there was anything abnormal in the logs?

pjbgf avatar Aug 08 '22 16:08 pjbgf