trivy-operator
trivy-operator copied to clipboard
Faulty scan jobs blocking further scans from being executed
What steps did you take and what happened:
Due to an error reported in https://github.com/aquasecurity/trivy-operator/issues/206 scan jobs getting stuck.
In this case, other PODs will not be scanned anymore as when the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
is reached, no more Scan PODs will be re-spawned up anymore as trivy-operator still wait for them to finish.
Example (due to the error in https://github.com/aquasecurity/trivy-operator/issues/206) :
scan-vulnerabilityreport-5759f44647--1-qf7sh 0/1 Completed 0 7m49s
scan-vulnerabilityreport-7d57cffd5f--1-47vds 0/1 Completed 0 2m58s
scan-vulnerabilityreport-849fffd5c7--1-p9fdt 0/1 Completed 0 6m58s
scan-vulnerabilityreport-dc5fb6cf--1-xq5kw 0/1 Completed 0 7m28s
scan-vulnerabilityreport-f49679dcc--1-cvd8x 0/1 Completed 0 118s
What did you expect to happen: Even though that jobs get stuck due to an unforeseen error, they should get released after some time to make sure that the scan will continue with other Repositories/Registries. Otherwise, no more scan is happening.
Anything else you would like to add:
If the Job/Pod gets manually deleted it is likely that trivy-operator
picks up any other remaining deployment to scan, and then
scanning continues, but if it comes back to the deployment which results back into the error, again the POD gets stuck.
So to get all deployments scanned you need to increase the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
to a high value
and you need frequently to delete all jobs/pods which got hung, to give 'trivy-operator' the freedom to re-spawn new scans.
Environment:
- Trivy-Operator version: 0.1.0
- Kubernetes version: 1.22
@VF-mbrauer thanks for reporting this , I will review it and update
I'll suggest to fix this by making the limiter don't count finished (completed/failed) jobs. That will also be a requirement for https://github.com/aquasecurity/trivy-operator/issues/228 used with the limiting feature, which I want to do. WDYT @chen-keinan? I can work on it if you agree on the suggested approach.
@erikgb, we need to be careful, because it will also lead to resource consumption, as the completed
ones will still occupy vCPU and MEM at that time. Therefore, we need to calculate and mention a slight increase of the resources even if you limit them with the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT
to a specific value.
@chen-keinan cc
I think you mean completed ones in
lead to resource consumption, as the not completed ones will still occupy vCPU and MEM at that time.
?
Yes, you are right, corrected my statement already. Thanks for that.
Something like activeDeadlineSeconds
would make sense to remove jobs/pods after some time and give space again for new scans to be initiated. This is just to prevent the stoppage. We still should drive the fix also for the ticket https://github.com/aquasecurity/trivy-operator/issues/206
@erikgb sure , pick it up. I agree with @VF-mbrauer there is a concern around completed jobs
piling up before cleanup has taken place. we can't count on opt-in TTL
as probably will not be available by default in all k8s versions.
@VF-mbrauer You can actually set activeDeadlineSeconds
on scan jobs by configuring OPERATOR_SCAN_JOB_TIMEOUT
. Seems to have a default value of 5m
. I don't have an environment where I can reproduce your problem. Maybe you can try and see if it helps?
@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.
Reproducing the issue is not necessary as the result is already in my statement above.
@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.
Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds
, and what the value is?
@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.
Hmmm, interesting! Can you check your scan Job yaml? If it includes
activeDeadlineSeconds
, and what the value is?
This is an extract of the Job-yaml:
spec:
activeDeadlineSeconds: 300
backoffLimit: 0
completionMode: NonIndexed
completions: 1
parallelism: 1
It contains the activeDeadlineSeconds
which is set to 5 minutes.
But the Age is over 14 hours:
#kubectl get job -n trivy-system
NAME COMPLETIONS DURATION AGE
scan-vulnerabilityreport-6f4658c97 1/1 112s 14h
#kubectl get pod -n trivy-system
NAME READY STATUS RESTARTS AGE
scan-vulnerabilityreport-6f4658c97--1-slxns 0/1 Completed 0 14h
One of the installations which proves that it completely blocks:
kubectl get pod -n trivy-system
NAME READY STATUS RESTARTS AGE
scan-vulnerabilityreport-547874795b--1-pfdb7 0/1 Completed 0 41h
scan-vulnerabilityreport-b7d7f6874--1-9dj5j 0/1 Completed 0 41h
trivy-exporter-77cdf45fc6-79v7s 1/1 Running 0 41h
trivy-operator-895486674-vm686 1/1 Running 0 41h
kubectl get job -n trivy-system
NAME COMPLETIONS DURATION AGE
scan-vulnerabilityreport-547874795b 1/1 52s 41h
scan-vulnerabilityreport-b7d7f6874 1/1 3m56s 41h
So no Vulnerabilities were scanned at all:
kubectl get vuln -A
No resources found
@erikgb @chen-keinan Aby news about that one. Until this gets finally solved I hesitate to rollout further. Jobs are still in a stuck state.
@VF-mbrauer this issue is under investigation, I will update you once we have a solid solution.
@chen-keinan any news on this one? Independent from any issue related to trivy-operator or trivy scanner, the job should be properly released and not get stuck for forever.
@VF-mbrauer you mention at the top that due to error #206 scan jobs getting stuck (meaning the scan job is completed but trivy-operator
is unable to process the report) , I assume this is not the case now , am I right?
@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.
But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.
@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.
But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.
trivy-operator
has the logic which know to delete jobs on completion (and report has been processed) or failure , the case where a scan has been completed but trivy-operator
is unable to process a report example: #206 ,should be fixed as reported immediately , by passing it as a generic solution might lead to data lose or jobs overflow, wdyt ?