trivy-operator icon indicating copy to clipboard operation
trivy-operator copied to clipboard

Reconciler error: Large size VulnerabilityReports will failed create and remains scan jobs

Open mrtc0 opened this issue 2 years ago • 16 comments

What steps did you take and what happened:

Creation of VulnerabilityReports fails if the number of detected vulnerabilities is too large.

Steps to reproduce

  1. Create a Pod with python:3.5.10-buster image on a cluster with trivy-operator installed

(python:3.5.10-buster has 1252 vulnerabilities.)

$ kubectl create deployment python -n default --image python:3.5.10-buster
  1. Scan finishes successfully, but creation of VulnerabilityReports fails
# Scan is completed successfully, but no VulnerabilityAlerts are created and no jobs are deleted
❯ kubectl get jobs -n trivy-system
NAME                                  COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-7d8db4fc84   1/1           73s        5m42s

❯ kubectl get pods -n trivy-system
NAME                                        READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-7d8db4fc84-gsztw   0/1     Completed   0          4m12s
trivy-operator-54bc4769db-5djnd             1/1     Running     0          85m

❯ kubectl -n trivy-system logs scan-vulnerabilityreport-7d8db4fc84-gsztw | tr -d "\n" | base64 -d | bunzip2 | jq .ArtifactName
Defaulted container "python" out of: python, 95164937-7089-491c-ab58-7f52bc8a8cce (init)
"python:3.5.10-buster"

❯ kubectl get vulnerabilityreports -n default
No resources found in default namespace.

# Reconciler error is logged in Operator
❯ kubectl logs -n trivy-system trivy-operator-54bc4769db-5djnd
...
{"level":"error","ts":1670231487.1708708,"msg":"Reconciler error","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"scan-vulnerabilityreport-7d8db4fc84","namespace":"trivy-system"},"namespace":"trivy-system","name":"scan-vulnerabilityreport-7d8db4fc84","reconcileID":"7e1ccc8e-f6eb-4080-b1dd-25dcdb150e7c","error":"rpc error: code = ResourceExhausted desc = trying to send message larger than max (2633583 vs. 2097152)","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

I am not familiar with the Kubernetes Operator implementation, but from the error message (trying to send message larger than max (2633583 vs. 2097152)) I would guess that it is due to the VulnerabilityAlerts being too large.

Normally, reconcileJobs() should delete the Job, but it gets stuck because it fails to create VulnerabilityReports. Therefore, when the number of similar jobs reaches the scanJobsConcurrentLimit, no new scans will be performed.

What did you expect to happen:

There are several possible approaches to solving this error:

  • Failure to create VulnerabilityReports Job to be deleted
  • VulnerabilityReports are successfully created

Environment:

  • Trivy-Operator version (use trivy-operator version): 0.7.1
  • Kubernetes version (use kubectl version): v1.25.0
  • OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc):

mrtc0 avatar Dec 05 '22 09:12 mrtc0