trivy-operator icon indicating copy to clipboard operation
trivy-operator copied to clipboard

OOMKilled in vulnerability scan job

Open dirien opened this issue 2 years ago • 31 comments

What steps did you take and what happened:

Since we migrate from starboard-operator to trivy-operator, we see now many jobs terminate with OOMKilled in the trivy-operator log:

{"level":"error","ts":1658230025.0035832,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"security/scan-vulnerabilityreport-5d76c6d6d8","container":"teleport","status.reason":"OOMKilled","status.message":"","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:363\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

We also increased the cpu and memory limit to:

ressources:
  limits:
    memory: 2Gi
    cpu: 2

But did not helped.

We run trivy in ClientServer

What did you expect to happen:

That there is no OOM error in the job.

Environment:

  • Trivy-Operator version (use trivy-operator version): 0.1.3
  • Kubernetes version (use kubectl version): 1.20

dirien avatar Jul 19 '22 11:07 dirien

@dirien can you please share more info :

  • request value stay as default for the vulnerability scan job ?
  • how many vulnerability scan jobs running on the same Node , where problem occur?
  • what is the memory available on the Node?

chen-keinan avatar Jul 19 '22 11:07 chen-keinan

Hi @chen-keinan,

ofc!

  • request is default
  • just this one. the other all run through.
  • 12GB

dirien avatar Jul 19 '22 12:07 dirien

  • request is default
  • just this one. the other all run through.
  • 12GB

Thanks can you please run : kubectl describe pod [name] and send here the output

chen-keinan avatar Jul 19 '22 12:07 chen-keinan

  • request is default
  • just this one. the other all run through.
  • 12GB

Thanks can you please run : kubectl describe pod [name] and send here the output

from the trivy-operator?

dirien avatar Jul 19 '22 12:07 dirien

from the trivy-operator?

on the pod (vulnerability scan job) that has the OOM issue.

There is a job that run the vulnerability scanning and went OOM , by default its running on trivy-system namespace , I need you to the the command above on that pod

chen-keinan avatar Jul 19 '22 12:07 chen-keinan

Unfortunatly its to quick deleted :(

❯ k describe pod scan-vulnerabilityreport-55787d6c98-bq4zn -n security Error from server (NotFound): pods "scan-vulnerabilityreport-55787d6c98-bq4zn" not found

dirien avatar Jul 19 '22 13:07 dirien

I hope you'll get better luck catching it next time , it will help to track the root cause of OOM , in the mean time will try to look for it myself

chen-keinan avatar Jul 19 '22 13:07 chen-keinan

I think this has nothing to do with luck, we need this feature IMO: https://github.com/aquasecurity/trivy-operator/issues/228 😉

erikgb avatar Jul 19 '22 13:07 erikgb

@erikgb go a head and pick it up , I assume its more important than other PRs . and let's make it configurable (if to use it or not)

chen-keinan avatar Jul 19 '22 13:07 chen-keinan

Would be awesome!

dirien avatar Jul 19 '22 13:07 dirien

@erikgb go a head and pick it up , I assume its more important than other PRs . and let's make it configurable (if to use it or not)

I would love to, but waiting for other PR making it easier to test things before suggesting new features. 😊

erikgb avatar Jul 19 '22 13:07 erikgb

I also encountered the same problem, the memory usage of the scan job did exceed the limit setting in the K8s resource quotas Screen Shot 2022-08-12 at 2 18 34 PM

smalltown avatar Aug 12 '22 06:08 smalltown

Hi there! Got the same problem with OOMKilled, but in my case, scan-vulnerabilityreport-* pod gets killed almost every time and each time it was during DB downloading process:

kubectl -n trivy-system describe pod/scan-vulnerabilityreport-6ff5d9956f-7f9bt
...skipped...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  58s   default-scheduler  Successfully assigned trivy-system/scan-vulnerabilityreport-6ff5d9956f-7f9bt to minikube
  Normal  Pulled     57s   kubelet            Container image "ghcr.io/aquasecurity/trivy:0.30.0" already present on machine
  Normal  Created    57s   kubelet            Created container 38b439b5-8f5d-409c-8211-c8fe84733bf9
  Normal  Started    57s   kubelet            Started container 38b439b5-8f5d-409c-8211-c8fe84733bf9
  Normal  Pulled     29s   kubelet            Container image "ghcr.io/aquasecurity/trivy:0.30.0" already present on machine
  Normal  Created    28s   kubelet            Created container hello-world
  Normal  Started    28s   kubelet            Started container hello-world

kubectl -n trivy-system logs -l app.kubernetes.io/name=trivy-operator
...skipped...
{"level":"error","ts":1662634462.1451323,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-6ff5d9956f","container":"hello-world","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

Loki query {pod="scan-vulnerabilityreport-6ff5d9956f-7f9bt"} |= ``
2022-09-08 13:53:29 2022-09-08T10:53:29.745Z	INFO	Need to update DB
2022-09-08 13:53:29 2022-09-08T10:53:29.745Z	INFO	DB Repository: ghcr.io/aquasecurity/trivy-db
2022-09-08 13:53:29 2022-09-08T10:53:29.745Z	INFO	Downloading DB...
2022-09-08 13:53:40 707.62 KiB / 33.81 MiB [->___________________________________________________________] 2.04% ? p/s ?1.64 MiB / 33.81 MiB [--->___________________________________________________________] 4.85% ? p/s ?2.81 MiB / 33.81 MiB [----->_________________________________________________________] 8.32% ? p/s ?3.98 MiB / 33.81 MiB [----->____________________________________________] 11.78% 5.48 MiB p/s ETA 5s5.15 MiB / 33.81 MiB [------->__________________________________________] 15.24% 5.48 MiB p/s ETA 5s6.25 MiB / 33.81 MiB [--------->________________________________________] 18.50% 5.48 MiB p/s ETA 5s7.27 MiB / 33.81 MiB [---------->_______________________________________] 21.50% 5.48 MiB p/s ETA 4s8.10 MiB / 33.81 MiB [----------->______________________________________] 23.95% 5.48 MiB p/s ETA 4s9.23 MiB / 33.81 MiB [------------->____________________________________] 27.30% 5.48 MiB p/s ETA 4s10.49 MiB / 33.81 MiB [--------------->_________________________________] 31.01% 5.47 MiB p/s ETA 4s11.52 MiB / 33.81 MiB [---------------->________________________________] 34.08% 5.47 MiB p/s ETA 4s12.50 MiB / 33.81 MiB [------------------>______________________________] 36.96% 5.47 MiB p/s ETA 3s13.67 MiB / 33.81 MiB [------------------->_____________________________] 40.44% 5.46 MiB p/s ETA 3s15.09 MiB / 33.81 MiB [--------------------->___________________________] 44.63% 5.46 MiB p/s ETA 3s16.42 MiB / 33.81 MiB [----------------------->_________________________] 48.57% 5.46 MiB p/s ETA 3s17.69 MiB / 33.81 MiB [------------------------->_______________________] 52.32% 5.54 MiB p/s ETA 2s19.15 MiB / 33.81 MiB [--------------------------->_____________________] 56.64% 5.54 MiB p/s ETA 2s20.65 MiB / 33.81 MiB [----------------------------->___________________] 61.08% 5.54 MiB p/s ETA 2s22.12 MiB / 33.81 MiB [-------------------------------->________________] 65.43% 5.66 MiB p/s ETA 2s23.49 MiB / 33.81 MiB [---------------------------------->______________] 69.46% 5.66 MiB p/s ETA 1s24.56 MiB / 33.81 MiB [----------------------------------->_____________] 72.62% 5.66 MiB p/s ETA 1s25.87 MiB / 33.81 MiB [------------------------------------->___________] 76.51% 5.70 MiB p/s ETA 1s27.04 MiB / 33.81 MiB [--------------------------------------->_________] 79.96% 5.70 MiB p/s ETA 1s28.08 MiB / 33.81 MiB [---------------------------------------->________] 83.06% 5.70 MiB p/s ETA 1s29.37 MiB / 33.81 MiB [------------------------------------------>______] 86.86% 5.71 MiB p/s ETA 0s30.61 MiB / 33.81 MiB [-------------------------------------------->____] 90.53% 5.71 MiB p/s ETA 0s32.00 MiB / 33.81 MiB [---------------------------------------------->__] 94.65% 5.71 MiB p/s ETA 0s33.15 MiB / 33.81 MiB [------------------------------------------------>] 98.04% 5.74 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.74 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.74 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.44 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.44 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.44 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.09 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.09 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 5.09 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.76 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.76 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.76 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.46 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.46 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.46 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.17 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.17 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [----------------------------------------------->] 100.00% 4.17 MiB p/s ETA 0s33.81 MiB / 33.81 MiB [--------------------------------------------------] 100.00% 3.78 MiB p/s 9.1s
2022-09-08 13:54:19 Killed

Pod eats 477M before gets killed. After increasing memory limit up to 1500M - OOMKilling was gone.

kubectl -n trivy-system edit cm trivy-operator-trivy-config
apiVersion: v1
data:
  ...skipped...
  trivy.resources.limits.cpu: 1500m
  trivy.resources.limits.memory: 1500M
  ...skipped...

Also I noticed that each scan-vulnerabilityreport-* pod downloads complete DB, so 10 parallel scans (default) does too much useless work by default

P.S. I think that increasing limit is not a solution in that case

SergeyBear avatar Sep 08 '22 13:09 SergeyBear

@SergeyBear this is the 1st time reported on OOM issue during download db. its strange as the db is relatively small.

can you please share :

  • how many scan job are running on the specific node that have OOM issue ?
  • how much memory is define for that node ?

As a workaround I would suggest to move to client/server mode where trivy db is downloaded only once on the server side

chen-keinan avatar Sep 08 '22 13:09 chen-keinan

@chen-keinan At the begining I used fresh minikube cluster with 3 cpus and 6 gigs of RAM, then deployed trivy-operator and dummy nodejs app on it and started to get OOMKilled, even when almost all 3 cores and 6 gigs was free.

Then I installed prometheus and loki stack to catch OOM, but there is still plenty of free resources:

  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1450m (48%)  900m (30%)
  memory             816Mi (13%)  782Mi (13%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)

It would be nice if someone with the same OOMKilled issue could check if OOM appears on DB downloading and tried to increase limit...

SergeyBear avatar Sep 08 '22 14:09 SergeyBear

@chen-keinan sorry, forgot to mention - it appears even on single scan job. After increasing trivy limits - OOM is gone and reports completed with no errors. P.S. It is strange that downloading could bring OOM, but checked more than ten times - everytime on donwloading...

SergeyBear avatar Sep 08 '22 14:09 SergeyBear

@chen-keinan sorry, forgot to mention - it appears even on single scan job. After increasing trivy limits - OOM is gone and reports completed with no errors. P.S. It is strange that downloading could bring OOM, but checked more than ten times - everytime on donwloading...

Thanks for the putting this info , we are investigating the scan job OOM issue (during scanning process) , I'll update shortly when we will completed the investigation

chen-keinan avatar Sep 08 '22 14:09 chen-keinan

@chen-keinan deployed trivy-server and set trivy-operator in ClientServer mode:

  • with 500M memory limit scan-vulnerabilityreport-* pod gets killed with empty logs
  • with 1500M memory limit scan-vulnerabilityreport-* pod completes fine; no DB update procedure performed
  • enabling OPERATOR_LOG_DEV_MODE shows no OOMKilled in logs and scan-vulnerabilityreport-* is NOT getting killed even with 500M memory limit... which is wierd, but switching true/false five times and result is the same - true (completed) / false (OOMKilled)

SergeyBear avatar Sep 09 '22 18:09 SergeyBear

Which version of Trivy are you guys using? Did anybody try v0.31.3? We added some improvements in v0.30.1.

knqyf263 avatar Sep 13 '22 14:09 knqyf263

@knqyf263 I'm using trivy-operator 0.1.9 that uses trivy 0.30.0 image

SergeyBear avatar Sep 13 '22 15:09 SergeyBear

@SergeyBear I have upgraded trivy version to 0.31.3 to be release with next trivy-operator version

chen-keinan avatar Sep 14 '22 18:09 chen-keinan

Hello there,

I think I experienced the same issue, pretty much all of the scan-vulerabilityreport failed but after downloading DB. I can't get any container logs.

:information_source: I have installed trivy-operator helm chart v0.1.9 in Standalone mode, with default resources definition.

See the trivy-operator logs here :

{"level":"error","ts":1663233467.890931,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-f48c5d464","container":"prometheus","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233467.8910558,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-f48c5d464","container":"thanos-sidecar","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233470.1496108,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-75d8fc986c","container":"loki","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233470.7312384,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-ffc4644dd","container":"alertmanager","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233471.7152555,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-8ddcf4c66","container":"kube-prometheus-stack","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233472.6912272,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-564bf7dc9c","container":"project-xxx","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233477.8117092,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-74f98796c5","container":"bdd-postgis","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233480.5189712,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-6d99dbdb57","container":"cert-manager","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233488.7973578,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-766578d848","container":"promtail","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233525.286571,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-5d7cb8455b","container":"kyverno","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233525.2866511,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-5d7cb8455b","container":"kyverno-pre","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1663233540.7202878,"logger":"reconciler.vulnerabilityreport","msg":"Scan job container","job":"trivy-system/scan-vulnerabilityreport-5489cbf97c","container":"cluster-register","status.reason":"OOMKilled","status.message":"Killed\n","stacktrace":"github.com/aquasecurity/trivy-operator/pkg/vulnerabilityreport.(*WorkloadController).reconcileJobs.func1\n\t/home/runner/work/trivy-operator/trivy-operator/pkg/vulnerabilityreport/controller.go:381\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/reconcile/reconcile.go:102\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

LucasVanHaaren avatar Sep 15 '22 09:09 LucasVanHaaren

@LucasVanHaaren you can try overriding trivy version by :

kubectl patch cm trivy-operator-trivy-config -n trivy-system \
  --type merge \
  -p "$(cat <<EOF
{
  "data": {
    "trivy.imageRef":      "ghcr.io/aquasecurity/trivy:0.31.3",
    
  }
}
EOF
)"

or wait for trivy-operator v0.2.0 which include this version by default

chen-keinan avatar Sep 15 '22 09:09 chen-keinan

Thanks for your response !

I just tried to upgrade to trivy:0.31.3 image and scan-vulnerabilityreports still be OOMKilled ... I also tried to increase memory limits to 1GiB and it's the same, nothing change.

Can it be possibly a too small nodes issue ? I ran a managed cluster with 2 worker nodes with 4CPU and 14GiB memory each. it host not many apps and seems not overwhelmed.

LucasVanHaaren avatar Sep 15 '22 12:09 LucasVanHaaren

Thanks for your response !

I just tried to upgrade to trivy:0.31.3 image and scan-vulnerabilityreports still be OOMKilled ... I also tried to increase memory limits to 1GiB and it's the same, nothing change.

Can it be possibly a too small nodes issue ? I ran a managed cluster with 2 worker nodes with 4CPU and 14GiB memory each. it host not many apps and seems not overwhelmed.

it could be , it is depend on the amount of workload running on you node. you need to check the limit.memory sum of all of your workloads ,it must not exceed Node memory. Note: that trivy-operator by default can produce up to 10 (configurable) scanJobs on parallel , so its needs to be taken under consideration as well

chen-keinan avatar Sep 15 '22 12:09 chen-keinan

1 gig limit is too low. I managed to get rid of OOMKilled only with 1.5 gig memory limit, if you have enough free memory in cluster of course

SergeyBear avatar Sep 15 '22 18:09 SergeyBear

also try to reduce number of parallel scan jobs in operator

SergeyBear avatar Sep 15 '22 18:09 SergeyBear

@chen-keinan After checking it, I confirm that the sum of the memory limits exceeds the amount of memory of the node.

@SergeyBear Thanks for the tips, I will try it soon with 1.5GiB mem limit and with only 2 parallel scan jobs, because cluster reliability is much important than the speed of image scanning !

I will also use a trivy server instance to avoid downloading db every time, maybe this will help too.

LucasVanHaaren avatar Sep 16 '22 12:09 LucasVanHaaren

Hey everybody, I applied all your tips (using a dedicated trivy server instance, setting 1.5G memory limit and reduce parallel scan jobs to 2) and now it works !

PS: I see sometimes more than 2 scan jobs in parallel but I no longer had a OOMKilled so it's great.

Thanks a lot :smiley:

LucasVanHaaren avatar Sep 21 '22 15:09 LucasVanHaaren

Just checked latest trivy-operator 0.3.0 on fresh minikube cluster (3 cpu 6gb ram) with only installed sealed-secrets and trivy-server - still gets OOMKilled with default memory 500M limit and OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT set to 1. Increasing memory limit to 1500M and OOMKilled goes away. Probably there is some short spike in memory consumption

SergeyBear avatar Sep 26 '22 09:09 SergeyBear