starboard icon indicating copy to clipboard operation
starboard copied to clipboard

Vulnerability scanning encounters "etcdserver: request is too large"

Open FrederikNJS opened this issue 3 years ago • 5 comments

What steps did you take and what happened:

I installed Starboard-operator using the helm chart and allowed it to run on my entire cluster. Some of the vulnerability scan jobs get stuck and the starboard-operator is logging messages about "etcdserver: request is too large". Here's a complete log line:

{"level":"error","ts":1641839401.717973,"logger":"controller.job","msg":"Reconciler error","reconciler group":"batch","reconciler kind":"Job","name":"scan-vulnerabilityreport-68cbdf566b","namespace":"starboard-system","error":"etcdserver: request is too large","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

I suspect that I have some images with way too many vulnerabilities... So being able to store them so I can track them down would be really nice.

What did you expect to happen:

I expected Starboard to be able to store the vulnerability reports properly.

Anything else you would like to add:

It seems to already be discussed in #208, but it seems that some information was stripped out of the vulnerabilityreport, and the issue was closed due to being "too unlikely", even though the issue still occurs for me.

My complete values for the helm chart is:

targetNamespaces: ""
trivy:
  githubToken: <REDACTED>
  resources:
    limits:
      memory: 1000Mi
    requests:
      memory: 1000Mi

Environment:

  • Helm chart version: 0.8.2
  • Starboard version (use starboard version): 0.13.2
  • Kubernetes version (use kubectl version): 1.19.7

FrederikNJS avatar Jan 10 '22 18:01 FrederikNJS

Additionally I can see that these stuck vulnerability scans count against the scanJobsConcurrentLimit, and the operator doesn't give up on them either when the scanJobTimeout expires...

My scanJobTimeout is set to the default 5 minutes, and I have seen jobs stuck for more than an hour, clogging up the system, blocking other scans from starting.

FrederikNJS avatar Jan 10 '22 23:01 FrederikNJS

As a workaround, I have tried limiting trivy's severity to only include HIGH,CRITICAL, which of course cuts down on the amount of vulnerabilities to write into the VulnerabilityReport, and in turn makes the reports small enough to save to etcd. This seems to work nicely, it would however still be nice to be able to save all the vulnerabilities, to get a complete overview.

FrederikNJS avatar Jan 10 '22 23:01 FrederikNJS

👋 @FrederikNS Thank you for the feedback. This is a well known limitation of Starboard (and K8s with its default etcd storage) right now, and we do not implement any fallback strategy. Do you have any ideas what we could do in such case?

BTW, is it possible to share the image and image size or at least the number of all vulnerabilities found by Trivy that cause this error?

danielpacak avatar Jan 19 '22 08:01 danielpacak

In the specific case that the report is too large I propose at least storing everything except the vulnerabilities list and adding an annotation starboard.aquasecurity.github.io/report-too-large=true or something, one could filter and monitor for

On a more general scope I assume compressing the vulnerabilities field could do the trick (helm went that way early on). OC this would require some changes throughout the complete tooling stack.

A more "advanced" change would be to allow storing the reports in a database, at least for the operator deployments. Maybe something memcache compatible, considering that the reports can be ephemeral. TBO abusing the ETCD resource store for this kind of data sounds like a malpractice altogether.

Another way might be to provide a report consumer that collects and stores the reports and services them on request e.g., via a webinterface.

Arabus avatar Jan 24 '22 11:01 Arabus

I like the current behavior of having the reports in same namespace as the resource they relate to. It makes it possible to use RBAC to restrict access to those reports (which may contain sensitive information).

If the reports are stored outside of Etcd, we need to make sure that we cannot access all the reports with a single set of credentials.

An idea would be to create a set of credentials in each namespace. Using the credentials from a namespace should only give access to the reports related to this namespace. This kind of behavior could work well with Minio where we could have one bucket per namespace. The credentials stored in a namespace then only give access to the bucket corresponding to this namespace.

bgoareguer avatar Jan 31 '22 16:01 bgoareguer