trivy-operator vulnerabilityScannerScanOnlyCurrentRevisions does not delete old reports

What steps did you take and what happened:

Enable vulnerabilityScannerScanOnlyCurrentRevisions on starboard operator
See reports generated only for most current deployments
Deploy new revision of deployment
See report generated for new deployment
See old report is still there

What did you expect to happen:

old report being deleted as I only want reports for the current deployments

Anything else you would like to add:

The implementation in #870 simply didn't add that part of the feature

Environment:

Starboard version (use starboard version): 0.14.0
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.2", GitCommit:"9d142434e3af351a628bffee3939e64c681afa4d", GitTreeState:"clean", BuildDate:"2022-01-19T17:27:51Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:59:07Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc):

> sw_vers
ProductName:    macOS
ProductVersion: 11.6.2
BuildVersion:   20G314

Jul 03 '22 09:07 chen-keinan

I've had a look at this, and the issue seems to originate from https://github.com/aquasecurity/starboard/issues/926, which is referring to https://github.com/aquasecurity/starboard/pull/870. The title on https://github.com/aquasecurity/starboard/pull/870, "feat(vulnrebilityreports) Only show the latest vulnrebilityreports from a deployment", might fool you to believe that the change fixed this issue, but I don't think that ever was the plan. @NissesSenap, please correct me if I am wrong!

The name of the configuration property clearly states what is does, at least to me: vulnerabilityScannerScanOnlyCurrentRevisions. So I don't expect this to delete reports from inactive replicasets. There are also some interesting points in the discussion leading up to this change: https://github.com/aquasecurity/starboard/discussions/858.

I think we should think very carefully before changing the behavior here. Since the ownerReference is set on the reports, we should ideally leave deletion up to the Kubernetes garbage collection mechanism. We usually set ownerReference.blockOwnerDeletion=true on dependent resources in our internal controllers, but AFAIK this should be cleaned up by Kubernetes anyway when the parent is gone.

This issue could be be solved by configuration, though not ideal. One option is to set vulnerabilityScannerReportTTL, causing reports to be deleted the configured duration after creation. Since vulnerabilityScannerScanOnlyCurrentRevisions set to true will skip inactive replicasets, this should achieve the desired result.

I am also eager to find a better solution to the desire behind this issue, but I think it is more complex than to just adjust the behavior of the existing vulnerabilityScannerScanOnlyCurrentRevisions config property.....

Jul 10 '22 20:07 erikgb

Thanks for the tag @erikgb , there was a long time ago so my memory is a bit hazy of our thinking. I know that we spoke about it but I think our general feeling was to keep it as simple as possible for starters.

But looking at it now I feel that it would be a nice feature to add auto cleanup of the old reports as well when creating a new RS, this way ensuring quicker metric correctness even though the TTL feature helps out with this.

As you point out if the RS is removed the report will also be deleted since it's the owner of it the resource.

The current solution is far from optimal. We went with this direction to be able to solve the issue without doing any breaking changes. Ideally I think we should change the default behavior and only scan on deployment level instead. Then we could reuse the same report and just update it over and over again. There are probably a bunch of reasons why we don't want to do this but i think it would lower the code complexity. If this is something that we are willing to discuss again I can create a separate discussion about it.

Taking the risk of starting to talk about different things but ideally I think we also should refactor the TTL feature completely and build it in to the controller instead of having a secondary controller to just set TTL.

My guess is that is something that the community don't want, so in summary about this specific issue I think we should add the feature of auto cleanup of existing reports when a new report have been generated when using the vulnerabilityScannerScanOnlyCurrentRevisions flag.

Jul 11 '22 08:07 NissesSenap

Thanks for your input @NissesSenap! Always good to get some background on changes. Thinking a bit more about it, I think we should introduce a new configuration parameter, which could be named something in the lines of vulnerabilityScannerRevisonHistoryLimit, which will allow us to not be "slaves" of deployments revisionHistoryLimit. At the same time we should deprecate vulnerabilityScannerScanOnlyCurrentRevisions and eventually remove after a couple of releases. Until it is removed, we could make the properties "connected" in a way that makes this work about the same as now. WDYT?

Jul 12 '22 21:07 erikgb

Sounds reasonable @erikgb. What would the default value of vulnerabilityScannerRevisonHistoryLimit be do you think?

Jul 13 '22 06:07 NissesSenap

Sounds reasonable @erikgb. What would the default value of vulnerabilityScannerRevisonHistoryLimit be do you think?

I would say the default value probably should be 0 when the transition period has ended, as I think a majority of users prefer to only keep reports for the active replicaset.

Jul 13 '22 06:07 erikgb

the same for chart version: 0.10.6 appVersion: 0.15.6

Jul 13 '22 15:07 melnikovpetr123

After going through the discussions here and in aquasecurity/starboard#858 and the associated proposal from @erikgb, I propose the following solution -

Introduce new configuration parameter vulnerabilityScannerRevisonHistoryLimit with the default value of 0. This parameter will control the number of old vulnerability reports to retain. TTLReportReconciler (https://github.com/aquasecurity/trivy-operator/blob/main/pkg/vulnerabilityreport/ttl_report.go) will be extended to support deleting older vulnerability reports in addition to deleting TTL expired reports when vulnerabilityScannerReportTTL configuration parameter is set.

This makes the vulnerabilityScannerScanOnlyCurrentRevisions configuration parameter redundant and harder to use along with the above new parameter vulnerabilityScannerRevisonHistoryLimit. I propose we remove this parameter (vulnerabilityScannerScanOnlyCurrentRevisions) to keep things simple. What this means in practice is when vulnerabilityScannerRevisonHistoryLimit is unset (= set to 0), only the latest revision is scanned and report is retained for that revision. If set to 1, then the 2 last revisions are scanned and reports retained for those 2 revisions.

If the above proposal sounds good, then I will start with the following tasks -

Start with unit tests for ttl_report.go so that I don’t break any existing functionality
Extend the TTL reconciler to support both TTL and vulnerabilityScannerRevisonHistoryLimit along with removing/deprecating vulnerabilityScannerScanOnlyCurrentRevisions parameter.

Comments?

Jul 14 '22 11:07 padlar

@padlar sorry for coming late on this , however I have to understand the suggested solution before going forward. removing vulnerabilityScannerRevisonHistoryLimit mean that we will continue to create reports for all previous revisions but now we only garbage collect it with vulnerabilityScannerReportTTL and vulnerabilityScannerRevisonHistoryLimit params ?

don't we want to delete as little historical reports, as we reconcile ReplicaSets and VulnerabilityReports in the same loop ,meaning every report deletion will trigger re-scan , wdyt ?

also please elaborate why vulnerabilityScannerReportTTL (if it set by default) only without any other change is not sufficient for deleting previous revisions ?

Jul 15 '22 18:07 chen-keinan

@padlar sorry for coming late on this , however I have to understand the suggested solution before going forward. removing vulnerabilityScannerRevisonHistoryLimit mean that we will continue to create reports for all previous revisions but now we only garbage collect it with vulnerabilityScannerReportTTL and vulnerabilityScannerRevisonHistoryLimit params ?

vulnerabilityScannerRevisonHistoryLimit default value will be 0, so by default report shall only be generated for the latest revision.

don't we want to delete as little historical reports, as we reconcile ReplicaSets and VulnerabilityReports in the same loop ,meaning every report deletion will trigger re-scan , wdyt ?

Logic for the new param vulnerabilityScannerRevisonHistoryLimit would be controlled from the main loop and wont trigger unncessary rescans.

also please elaborate why vulnerabilityScannerReportTTL (if it set by default) only without any other change is not sufficient for deleting previous revisions ?

Just setting VulnerabilityScannerReportTTL alone is not sufficient to delete the old reports. For example if there are 8 replica sets and the TTL is set to 24hrs, vulnerability reports for all 8 replica sets will be regenerated post TTL expiry. So to delete old reports without regenerating them, user has to set vulnerabilityScannerScanOnlyCurrentRevisions along with VulnerabilityScannerReportTTL. This restricts the scanning and the report generation to the latest revision.

Our proposal to replace vulnerabilityScannerScanOnlyCurrentRevisions with vulnerabilityScannerRevisonHistoryLimit allows the users to control the number of revisions to scan (default will be set to 0 – scan only the latest revision). This is a nice extension to the deployment clean pod policy Deployments | Kubernetes.

Jul 18 '22 08:07 padlar

this is what we want to restricts the scanning and the report generation to the latest revision trivy operator is runtime scanner it should only present what is running now on the cluster.

so vulnerabilityScannerScanOnlyCurrentRevisions along with VulnerabilityScannerReportTTL should solve the issue

Jul 18 '22 11:07 chen-keinan

IMO is keeping a scan report around, but not (re-)scan it, not optimal. And this issue really shows what users expects out of the vulnerabilityScannerScanOnlyCurrentRevisions property. It just doesn't do what users expect/want! Even if the property-name should give you an idea. I think we'll just wait for https://github.com/aquasecurity/trivy-operator/issues/331 to revisit this issue. If someone else like to contribute an improvement using the existing properties, I would be happy for that. But from our point of view, the current properties (vulnerabilityScannerScanOnlyCurrentRevisions and vulnerabilityScannerReportTTL ), and the correlation between, should be reconsidered. And I think a new property like vulnerabilityScannerRevisonHistoryLimit will fit well in the standard K8s controller pattern with spec/status as suggested in https://github.com/aquasecurity/trivy-operator/issues/331, and combined will obsolete both existing properties.

Jul 18 '22 13:07 erikgb

I can add that in our internal image-scanner operator we use spec/status actively: We add a lastScanTimestamp to the resource status, and perform a re-scan whenever we are past a configurable duration since lastScanTimestamp. So no need for for TTL, which performs an imperative operation (delete) - just to get a re-scan of your workload image. The suggested vulnerabilityScannerRevisonHistoryLimit will just ensure you don't scan AND keep old reports for old replicasets - like a "sliding window" of image scans for workloads.

Jul 18 '22 14:07 erikgb

IMO vulnerabilityScannerScanOnlyCurrentRevisions and vulnerabilityScannerReportTTL both should be activated by default , it will give us the same result. I do not see a need for introducing additional logic on top of it.

Jul 18 '22 14:07 chen-keinan

We have a bunch of clusters which get cluttered with old reports fairly fast. I honestly expected from the name vulnerabilityScannerScanOnlyCurrentRevisions that it would only keep new reports and remove old ones. I feel like the name is misleading.

In any case, we use the grafana dashboards from starboard-exporter and having old versions floating around will balloon the amount of found CVEs. Which means the numbers in that dashboards are useless, because they don't represent the current cluster-state. To fix this we have to manually delete old reports which is tedious.

Jul 29 '22 13:07 MPritsch

I also think the var name vulnerabilityScannerScanOnlyCurrentRevisions is misleading. I was expecting not to see new reports for old replica sets that are not active.

I have a few days old replica set, but I see vulnerability scans for all of them, and the reports themselves are 1hr old.

Oct 20 '22 15:10 komljen

@komljen in the prev. version of trivy-operator , vulnerabilityScannerScanOnlyCurrentRevisions and TTL has been enabled by default , you should not see reports older then 1 day

Oct 20 '22 16:10 chen-keinan

I don't see old reports, but those reports scan for images in the old replica sets which are not active anymore. So, unless I remove the old replica set I would still see the vulnerability reported, even though the image from the old replica sets are no longer running.

Oct 20 '22 16:10 komljen

If I have those 3 replica sets:

inalytics-595c655c75                  0         0         0       3d1h
inalytics-5bbb6c96d4                  0         0         0       2d1h
inalytics-6575c95c48                  6         6         6       26h

Only the last one has active/running containers. But the scanner will check for all of those. And currently, there isn't an option to run a scan only on the latest one.

Oct 20 '22 16:10 komljen

Which trivy-operator version are you using? Or Did you set vulnerabilityScannerScanOnlyCurrentRevisions to false ?

Oct 20 '22 16:10 chen-keinan

It is set to true, just checked in the pod:

    - name: OPERATOR_VULNERABILITY_SCANNER_SCAN_ONLY_CURRENT_REVISIONS
      value: "true"

This is the version I'm running https://github.com/aquasecurity/starboard/releases/tag/v0.15.6.

Oct 20 '22 16:10 komljen

can you find me on slack , I'll explain better

On Thu, Oct 20, 2022, 7:24 PM Alen Komljen @.***> wrote:

If I have those 3 replica sets:

inalytics-595c655c75 0 0 0 3d1h inalytics-5bbb6c96d4 0 0 0 2d1h inalytics-6575c95c48 6 6 6 26h

Only the last one has active/running containers. But the scanner will check for all of those. And currently, there isn't an option to run a scan only on the latest one.

— Reply to this email directly, view it on GitHub https://github.com/aquasecurity/trivy-operator/issues/237#issuecomment-1285835486, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKUI6NZ2TNDVPLRJIACPVTWEFW45ANCNFSM52QMRHBQ . You are receiving this because you authored the thread.Message ID: @.***>

Oct 20 '22 16:10 chen-keinan

I see , you are using starboard not trivy-operator

Oct 20 '22 16:10 chen-keinan

What is the difference? Cannot join slack, it reports a server error. Will try again later.

Oct 20 '22 16:10 komljen

These is a huge different between starboard to trivy-operator , as I mention in trivy-operator you should not see this issue , starboard is in maintenance mode in last six months , no improvments or advance features

Oct 20 '22 16:10 chen-keinan

I see, I will try replacing it with trivy operator and post here for results. Thanks so much for the quick answers.

Oct 20 '22 16:10 komljen

trivy-operator trivy-operator copied to clipboard

vulnerabilityScannerScanOnlyCurrentRevisions does not delete old reports

trivy-operator
trivy-operator copied to clipboard