trivy-operator icon indicating copy to clipboard operation
trivy-operator copied to clipboard

Running operator on containerd cuts the logs in client/server mode

Open 1003n40 opened this issue 2 years ago • 3 comments

What steps did you take and what happened:

I took the last version of trivy-operator and started it with client server mode, which worked out for the most pods, but on random it starts to fail because of cut logs, thus it can not parse them when retrieving them. This doesn't happen when running on dockerd as a CRI. The same can happens when you try to run multiple pods calling the same client command to server. What did you expect to happen: I expect logs not to get cut when running on containerd as CRI Anything else you would like to add: What else can be done is revert trivy image version to 0.29.2 (in values.yaml of helm chart), as in the new versions 0.30.* memory usage is big, which causes to OOM the pods created by the job as the limits aren't changed when new release was created. [Miscellaneous information that will assist in solving the issue.] What probably causes the problem is because having multiple client opens multiple rpc channels, and some of them are getting closed/gc in the middle of transfer of the information.

  • Trivy-Operator version (use trivy-operator version): latest
  • Kubernetes version (use kubectl version): 1.21.14
  • OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): gardenlinux

1003n40 avatar Aug 09 '22 08:08 1003n40

@1003n40 thank you for the input can you please add logging or additional info on the failure

chen-keinan avatar Aug 16 '22 08:08 chen-keinan

@chen-keinan could this also be related to kubernetes log rotation configuration?

josedonizetti avatar Aug 22 '22 20:08 josedonizetti

@1003n40 you can change the trivy image tag by setting this value in trivy-operator-trivy-config configMap :

imageRef: ghcr.io/aquasecurity/trivy:0.29.1

chen-keinan avatar Aug 23 '22 05:08 chen-keinan

This issue is stale because it has been labeled with inactivity.

github-actions[bot] avatar Nov 21 '22 00:11 github-actions[bot]

This issue is stale because it has been labeled with inactivity.

github-actions[bot] avatar Feb 22 '23 00:02 github-actions[bot]

Hi, I'm not sure if this is related, but we are seeing the same behavior when running trivy client/server in a k3d/k3s cluster. The trivy client runs in a kubernetes Job and sometimes the scan results are cut when we fetch the logs from the Job.

We are only able to reproduce this behavior on GitHub when using the default runners (https://github.com/statnett/image-scanner-operator/actions/runs/4419186007/jobs/7747301776#step:13:381) and when running the tests on an old Mac. And the behavior typically only occurs when the scan results are large.

We suspect that this is related to hardware constraints, since we are only able to reproduce this when running k3d/k3s on machines with limited CPU/memory resources.

bendikp avatar Mar 15 '23 10:03 bendikp

The corresponding log on the node looks like this:

2023-03-14T18:54:13.784447539Z stdout F   {
2023-03-14T18:54:13.784566339Z stdout F     "fixedVersion": "5.20.2-3+deb8u11",
2023-03-14T18:54:13.784572939Z stdout F     "installedVersion": "5.20.2-3+deb8u6",
2023-03-14T18:54:13.784576439Z stdout F     "pkgName": "perl",
2023-03-14T18:54:13.784580039Z stdout F     "primaryURL": "https://avd.aquasec.com/nvd/cve-2018-12015",
2023-03-14T18:54:13.784583439Z stdout F     "severity": "HIGH",
2023-03-14T18:54:13.784586939Z stdout F     "title": "perl: Directory traversal in Archive::Tar",
2023-03-14T18:54:13.784591039Z stdout P     "vulnerabilityID": "CVE-2018-120

Where stdout P indicates a partial log entry.

bendikp avatar Mar 15 '23 10:03 bendikp

could be related to container log rotation.

workaround : Increase the kubelet default --container-log-max-size

trivy-operator support compression for scan-job log output to avoid this issue

chen-keinan avatar Mar 15 '23 11:03 chen-keinan

Update: after switching to "larger runners" on GitHub we haven't seen this issue.

I don't think it's related to log rotation, as the log file was only ~2MB and there has no second log file available on the node.

bendikp avatar Mar 17 '23 13:03 bendikp

@chen-keinan containerd/containerd#7289 It is indeed containerd problem.

1003n40 avatar Mar 18 '23 15:03 1003n40

This issue is stale because it has been labeled with inactivity.

github-actions[bot] avatar Jun 28 '23 00:06 github-actions[bot]