image-reflector-controller
image-reflector-controller copied to clipboard
Image-reflector-controller restarts due to OOM Killed
Describe the bug
I run Flux on AWS EKS 1.21.5. I've noticed that after the last Flux update, sometimes happens that the image-reflector-controller
pod is restarted due to OOM Killed
, even if it has a high CPU and Memory Request/Limit. The number of Helm Releases is between 30 and 40.
- CPU Request: 0.05
- CPU Limit: 0.1
- CPU Average Usage: 0.006
- Memory Request: 384 MB
- Memory Limit: 640 MB
- Memory Average Usage: 187 MB
Steps to reproduce
N/A
Expected behavior
I expect image-reflector-controller
to not restart due to OOM Killed
.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v0.31.3
Flux check
► checking prerequisites ✔ Kubernetes 1.21.12-eks-a64ea69 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.21.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.22.1 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.18.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.25.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.5 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.24.4 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta1 ✔ buckets.source.toolkit.fluxcd.io/v1beta1 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta1 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta1 ✔ providers.notification.toolkit.fluxcd.io/v1beta1 ✔ receivers.notification.toolkit.fluxcd.io/v1beta1 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
The image-reflector-controller
has nothing to do with Helm. Can you please post here kubectl describe deployment
for the controller that runs into OOM.
Name: image-reflector-controller
Namespace: flux-system
CreationTimestamp: Thu, 23 Dec 2021 11:29:24 +0100
Labels: app.kubernetes.io/instance=flux-system
app.kubernetes.io/part-of=flux
app.kubernetes.io/version=v0.30.2
control-plane=controller
kustomize.toolkit.fluxcd.io/name=flux-system
kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations: deployment.kubernetes.io/revision: 6
Selector: app=image-reflector-controller
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=image-reflector-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: image-reflector-controller
Containers:
manager:
Image: ghcr.io/fluxcd/image-reflector-controller:v0.18.0
Ports: 8080/TCP, 9440/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--events-addr=http://notification-controller.flux-system.svc.cluster.local./
--watch-all-namespaces=true
--log-level=info
--log-encoding=json
--enable-leader-election
Limits:
cpu: 100m
memory: 640Mi
Requests:
cpu: 50m
memory: 384Mi
Liveness: http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
RUNTIME_NAMESPACE: (v1:metadata.namespace)
Mounts:
/data from data (rw)
/tmp from temp (rw)
Volumes:
temp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: image-reflector-controller-db97c765d (1/1 replicas created)
Events: <none>
@Andrea-Gallicchio can you confirm whether just before the OOM occurred there was anything abnormal in the logs?