trivy-operator
trivy-operator copied to clipboard
vulnerabilityScannerScanOnlyCurrentRevisions does not delete old reports
What steps did you take and what happened:
- Enable vulnerabilityScannerScanOnlyCurrentRevisions on starboard operator
- See reports generated only for most current deployments
- Deploy new revision of deployment
- See report generated for new deployment
- See old report is still there
What did you expect to happen:
- old report being deleted as I only want reports for the current deployments
Anything else you would like to add:
- The implementation in #870 simply didn't add that part of the feature
Environment:
- Starboard version (use
starboard version
):0.14.0
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.2", GitCommit:"9d142434e3af351a628bffee3939e64c681afa4d", GitTreeState:"clean", BuildDate:"2022-01-19T17:27:51Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:59:07Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
- OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc):
> sw_vers
ProductName: macOS
ProductVersion: 11.6.2
BuildVersion: 20G314
I've had a look at this, and the issue seems to originate from https://github.com/aquasecurity/starboard/issues/926, which is referring to https://github.com/aquasecurity/starboard/pull/870. The title on https://github.com/aquasecurity/starboard/pull/870, "feat(vulnrebilityreports) Only show the latest vulnrebilityreports from a deployment", might fool you to believe that the change fixed this issue, but I don't think that ever was the plan. @NissesSenap, please correct me if I am wrong!
The name of the configuration property clearly states what is does, at least to me: vulnerabilityScannerScanOnlyCurrentRevisions. So I don't expect this to delete reports from inactive replicasets. There are also some interesting points in the discussion leading up to this change: https://github.com/aquasecurity/starboard/discussions/858.
I think we should think very carefully before changing the behavior here. Since the ownerReference is set on the reports, we should ideally leave deletion up to the Kubernetes garbage collection mechanism. We usually set ownerReference.blockOwnerDeletion=true
on dependent resources in our internal controllers, but AFAIK this should be cleaned up by Kubernetes anyway when the parent is gone.
This issue could be be solved by configuration, though not ideal. One option is to set vulnerabilityScannerReportTTL
, causing reports to be deleted the configured duration after creation. Since vulnerabilityScannerScanOnlyCurrentRevisions
set to true will skip inactive replicasets, this should achieve the desired result.
I am also eager to find a better solution to the desire behind this issue, but I think it is more complex than to just adjust the behavior of the existing vulnerabilityScannerScanOnlyCurrentRevisions
config property.....
Thanks for the tag @erikgb , there was a long time ago so my memory is a bit hazy of our thinking. I know that we spoke about it but I think our general feeling was to keep it as simple as possible for starters.
But looking at it now I feel that it would be a nice feature to add auto cleanup of the old reports as well when creating a new RS, this way ensuring quicker metric correctness even though the TTL feature helps out with this.
As you point out if the RS is removed the report will also be deleted since it's the owner of it the resource.
The current solution is far from optimal. We went with this direction to be able to solve the issue without doing any breaking changes. Ideally I think we should change the default behavior and only scan on deployment level instead. Then we could reuse the same report and just update it over and over again. There are probably a bunch of reasons why we don't want to do this but i think it would lower the code complexity. If this is something that we are willing to discuss again I can create a separate discussion about it.
Taking the risk of starting to talk about different things but ideally I think we also should refactor the TTL feature completely and build it in to the controller instead of having a secondary controller to just set TTL.
My guess is that is something that the community don't want, so in summary about this specific issue I think we should add the feature of auto cleanup of existing reports when a new report have been generated when using the vulnerabilityScannerScanOnlyCurrentRevisions
flag.
Thanks for your input @NissesSenap! Always good to get some background on changes. Thinking a bit more about it, I think we should introduce a new configuration parameter, which could be named something in the lines of vulnerabilityScannerRevisonHistoryLimit
, which will allow us to not be "slaves" of deployments revisionHistoryLimit. At the same time we should deprecate vulnerabilityScannerScanOnlyCurrentRevisions
and eventually remove after a couple of releases. Until it is removed, we could make the properties "connected" in a way that makes this work about the same as now. WDYT?
Sounds reasonable @erikgb. What would the default value of vulnerabilityScannerRevisonHistoryLimit be do you think?
Sounds reasonable @erikgb. What would the default value of vulnerabilityScannerRevisonHistoryLimit be do you think?
I would say the default value probably should be 0 when the transition period has ended, as I think a majority of users prefer to only keep reports for the active replicaset.
the same for chart version: 0.10.6 appVersion: 0.15.6
After going through the discussions here and in aquasecurity/starboard#858 and the associated proposal from @erikgb, I propose the following solution -
Introduce new configuration parameter vulnerabilityScannerRevisonHistoryLimit with the default value of 0. This parameter will control the number of old vulnerability reports to retain. TTLReportReconciler (https://github.com/aquasecurity/trivy-operator/blob/main/pkg/vulnerabilityreport/ttl_report.go) will be extended to support deleting older vulnerability reports in addition to deleting TTL expired reports when vulnerabilityScannerReportTTL configuration parameter is set.
This makes the vulnerabilityScannerScanOnlyCurrentRevisions configuration parameter redundant and harder to use along with the above new parameter vulnerabilityScannerRevisonHistoryLimit. I propose we remove this parameter (vulnerabilityScannerScanOnlyCurrentRevisions) to keep things simple. What this means in practice is when vulnerabilityScannerRevisonHistoryLimit is unset (= set to 0), only the latest revision is scanned and report is retained for that revision. If set to 1, then the 2 last revisions are scanned and reports retained for those 2 revisions.
If the above proposal sounds good, then I will start with the following tasks -
- Start with unit tests for ttl_report.go so that I don’t break any existing functionality
- Extend the TTL reconciler to support both TTL and vulnerabilityScannerRevisonHistoryLimit along with removing/deprecating vulnerabilityScannerScanOnlyCurrentRevisions parameter.
Comments?
@padlar sorry for coming late on this , however I have to understand the suggested solution before going forward.
removing vulnerabilityScannerRevisonHistoryLimit
mean that we will continue to create reports for all previous revisions
but now we only garbage collect it with vulnerabilityScannerReportTTL
and vulnerabilityScannerRevisonHistoryLimit
params ?
don't we want to delete as little historical reports, as we reconcile ReplicaSets
and VulnerabilityReports
in the same loop ,meaning every report deletion will trigger re-scan , wdyt ?
also please elaborate why vulnerabilityScannerReportTTL
(if it set by default) only without any other change is not sufficient for deleting previous revisions ?
@padlar sorry for coming late on this , however I have to understand the suggested solution before going forward. removing
vulnerabilityScannerRevisonHistoryLimit
mean that we will continue to create reports forall previous revisions
but now we only garbage collect it withvulnerabilityScannerReportTTL
andvulnerabilityScannerRevisonHistoryLimit
params ?
vulnerabilityScannerRevisonHistoryLimit
default value will be 0, so by default report shall only be generated for the latest revision.
don't we want to delete as little historical reports, as we reconcile
ReplicaSets
andVulnerabilityReports
in the same loop ,meaning every report deletion will trigger re-scan , wdyt ?
Logic for the new param vulnerabilityScannerRevisonHistoryLimit
would be controlled from the main loop and wont trigger unncessary rescans.
also please elaborate why
vulnerabilityScannerReportTTL
(if it set by default) only without any other change is not sufficient for deleting previous revisions ?
Just setting VulnerabilityScannerReportTTL alone is not sufficient to delete the old reports. For example if there are 8 replica sets and the TTL is set to 24hrs, vulnerability reports for all 8 replica sets will be regenerated post TTL expiry. So to delete old reports without regenerating them, user has to set vulnerabilityScannerScanOnlyCurrentRevisions along with VulnerabilityScannerReportTTL. This restricts the scanning and the report generation to the latest revision.
Our proposal to replace vulnerabilityScannerScanOnlyCurrentRevisions with vulnerabilityScannerRevisonHistoryLimit allows the users to control the number of revisions to scan (default will be set to 0 – scan only the latest revision). This is a nice extension to the deployment clean pod policy Deployments | Kubernetes.
this is what we want to restricts the scanning and the report generation to the latest revision
trivy operator is runtime scanner it should only present what is running now on the cluster.
so vulnerabilityScannerScanOnlyCurrentRevisions
along with VulnerabilityScannerReportTTL
should solve the issue
IMO is keeping a scan report around, but not (re-)scan it, not optimal. And this issue really shows what users expects out of the vulnerabilityScannerScanOnlyCurrentRevisions
property. It just doesn't do what users expect/want! Even if the property-name should give you an idea. I think we'll just wait for https://github.com/aquasecurity/trivy-operator/issues/331 to revisit this issue. If someone else like to contribute an improvement using the existing properties, I would be happy for that. But from our point of view, the current properties (vulnerabilityScannerScanOnlyCurrentRevisions
and vulnerabilityScannerReportTTL
), and the correlation between, should be reconsidered. And I think a new property like vulnerabilityScannerRevisonHistoryLimit
will fit well in the standard K8s controller pattern with spec/status as suggested in https://github.com/aquasecurity/trivy-operator/issues/331, and combined will obsolete both existing properties.
I can add that in our internal image-scanner operator we use spec/status actively: We add a lastScanTimestamp
to the resource status, and perform a re-scan whenever we are past a configurable duration since lastScanTimestamp
. So no need for for TTL, which performs an imperative operation (delete) - just to get a re-scan of your workload image. The suggested vulnerabilityScannerRevisonHistoryLimit
will just ensure you don't scan AND keep old reports for old replicasets - like a "sliding window" of image scans for workloads.
IMO vulnerabilityScannerScanOnlyCurrentRevisions
and vulnerabilityScannerReportTTL
both should be activated by default , it will give us the same result.
I do not see a need for introducing additional logic on top of it.
We have a bunch of clusters which get cluttered with old reports fairly fast. I honestly expected from the name vulnerabilityScannerScanOnlyCurrentRevisions
that it would only keep new reports and remove old ones. I feel like the name is misleading.
In any case, we use the grafana dashboards from starboard-exporter and having old versions floating around will balloon the amount of found CVEs. Which means the numbers in that dashboards are useless, because they don't represent the current cluster-state. To fix this we have to manually delete old reports which is tedious.
I also think the var name vulnerabilityScannerScanOnlyCurrentRevisions
is misleading.
I was expecting not to see new reports for old replica sets that are not active.
I have a few days old replica set, but I see vulnerability scans for all of them, and the reports themselves are 1hr old.
@komljen in the prev. version of trivy-operator
, vulnerabilityScannerScanOnlyCurrentRevisions
and TTL
has been enabled by default , you should not see reports older then 1 day
I don't see old reports, but those reports scan for images in the old replica sets which are not active anymore. So, unless I remove the old replica set I would still see the vulnerability reported, even though the image from the old replica sets are no longer running.
If I have those 3 replica sets:
inalytics-595c655c75 0 0 0 3d1h
inalytics-5bbb6c96d4 0 0 0 2d1h
inalytics-6575c95c48 6 6 6 26h
Only the last one has active/running containers. But the scanner will check for all of those. And currently, there isn't an option to run a scan only on the latest one.
Which trivy-operator version are you using?
Or Did you set vulnerabilityScannerScanOnlyCurrentRevisions
to false ?
It is set to true, just checked in the pod:
- name: OPERATOR_VULNERABILITY_SCANNER_SCAN_ONLY_CURRENT_REVISIONS
value: "true"
This is the version I'm running https://github.com/aquasecurity/starboard/releases/tag/v0.15.6.
can you find me on slack , I'll explain better
On Thu, Oct 20, 2022, 7:24 PM Alen Komljen @.***> wrote:
If I have those 3 replica sets:
inalytics-595c655c75 0 0 0 3d1h inalytics-5bbb6c96d4 0 0 0 2d1h inalytics-6575c95c48 6 6 6 26h
Only the last one has active/running containers. But the scanner will check for all of those. And currently, there isn't an option to run a scan only on the latest one.
— Reply to this email directly, view it on GitHub https://github.com/aquasecurity/trivy-operator/issues/237#issuecomment-1285835486, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKUI6NZ2TNDVPLRJIACPVTWEFW45ANCNFSM52QMRHBQ . You are receiving this because you authored the thread.Message ID: @.***>
I see , you are using starboard
not trivy-operator
What is the difference? Cannot join slack, it reports a server error. Will try again later.
These is a huge different between starboard
to trivy-operator
, as I mention in trivy-operator you should not see this issue , starboard is in maintenance mode in last six months , no improvments or advance features
I see, I will try replacing it with trivy operator and post here for results. Thanks so much for the quick answers.