trivy-operator Faulty scan jobs blocking further scans from being executed

What steps did you take and what happened:

Due to an error reported in https://github.com/aquasecurity/trivy-operator/issues/206 scan jobs getting stuck. In this case, other PODs will not be scanned anymore as when the OPERATOR_CONCURRENT_SCAN_JOBS_LIMITis reached, no more Scan PODs will be re-spawned up anymore as trivy-operator still wait for them to finish.

Example (due to the error in https://github.com/aquasecurity/trivy-operator/issues/206) :

scan-vulnerabilityreport-5759f44647--1-qf7sh   0/1     Completed   0          7m49s
scan-vulnerabilityreport-7d57cffd5f--1-47vds   0/1     Completed   0          2m58s
scan-vulnerabilityreport-849fffd5c7--1-p9fdt   0/1     Completed   0          6m58s
scan-vulnerabilityreport-dc5fb6cf--1-xq5kw     0/1     Completed   0          7m28s
scan-vulnerabilityreport-f49679dcc--1-cvd8x    0/1     Completed   0          118s

What did you expect to happen: Even though that jobs get stuck due to an unforeseen error, they should get released after some time to make sure that the scan will continue with other Repositories/Registries. Otherwise, no more scan is happening.

Anything else you would like to add:

If the Job/Pod gets manually deleted it is likely that trivy-operator picks up any other remaining deployment to scan, and then scanning continues, but if it comes back to the deployment which results back into the error, again the POD gets stuck. So to get all deployments scanned you need to increase the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a high value and you need frequently to delete all jobs/pods which got hung, to give 'trivy-operator' the freedom to re-spawn new scans.

Environment:

Trivy-Operator version: 0.1.0
Kubernetes version: 1.22

Jul 12 '22 07:07 VF-mbrauer

@VF-mbrauer thanks for reporting this , I will review it and update

Jul 12 '22 07:07 chen-keinan

I'll suggest to fix this by making the limiter don't count finished (completed/failed) jobs. That will also be a requirement for https://github.com/aquasecurity/trivy-operator/issues/228 used with the limiting feature, which I want to do. WDYT @chen-keinan? I can work on it if you agree on the suggested approach.

Jul 12 '22 10:07 erikgb

@erikgb, we need to be careful, because it will also lead to resource consumption, as the completed ones will still occupy vCPU and MEM at that time. Therefore, we need to calculate and mention a slight increase of the resources even if you limit them with the OPERATOR_CONCURRENT_SCAN_JOBS_LIMIT to a specific value.

@chen-keinan cc

Jul 12 '22 10:07 VF-mbrauer

I think you mean completed ones in

lead to resource consumption, as the not completed ones will still occupy vCPU and MEM at that time.

?

Jul 12 '22 10:07 erikgb

Yes, you are right, corrected my statement already. Thanks for that.

Jul 12 '22 10:07 VF-mbrauer

Something like activeDeadlineSeconds would make sense to remove jobs/pods after some time and give space again for new scans to be initiated. This is just to prevent the stoppage. We still should drive the fix also for the ticket https://github.com/aquasecurity/trivy-operator/issues/206

Jul 12 '22 10:07 VF-mbrauer

@erikgb sure , pick it up. I agree with @VF-mbrauer there is a concern around completed jobs piling up before cleanup has taken place. we can't count on opt-in TTL as probably will not be available by default in all k8s versions.

Jul 12 '22 11:07 chen-keinan

@VF-mbrauer You can actually set activeDeadlineSeconds on scan jobs by configuring OPERATOR_SCAN_JOB_TIMEOUT. Seems to have a default value of 5m. I don't have an environment where I can reproduce your problem. Maybe you can try and see if it helps?

Jul 12 '22 19:07 erikgb

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Reproducing the issue is not necessary as the result is already in my statement above.

Jul 12 '22 20:07 VF-mbrauer

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds, and what the value is?

Jul 12 '22 21:07 erikgb

@erikgb The default 5 minutes seems also not to help in this case as jobs where sittings there for hours/days. So even lower that value to 1 minutes or similar seems also to be useless. So we should also check the meaning of this settings and where does it really influence something.

Hmmm, interesting! Can you check your scan Job yaml? If it includes activeDeadlineSeconds, and what the value is?

This is an extract of the Job-yaml:

 spec:
    activeDeadlineSeconds: 300
    backoffLimit: 0
    completionMode: NonIndexed
    completions: 1
    parallelism: 1

It contains the activeDeadlineSeconds which is set to 5 minutes.

But the Age is over 14 hours:

#kubectl get job -n trivy-system                                                                               
NAME                                 COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-6f4658c97   1/1           112s       14h

#kubectl get pod -n trivy-system                                                                            
NAME                                          READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-6f4658c97--1-slxns   0/1     Completed   0          14h

Jul 13 '22 07:07 VF-mbrauer

One of the installations which proves that it completely blocks:

kubectl get pod -n trivy-system                                                                                                                 
NAME                                           READY   STATUS      RESTARTS   AGE
scan-vulnerabilityreport-547874795b--1-pfdb7   0/1     Completed   0          41h
scan-vulnerabilityreport-b7d7f6874--1-9dj5j    0/1     Completed   0          41h
trivy-exporter-77cdf45fc6-79v7s                1/1     Running     0          41h
trivy-operator-895486674-vm686                 1/1     Running     0          41h

kubectl get job -n trivy-system                                                                                                               
NAME                                  COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-547874795b   1/1           52s        41h
scan-vulnerabilityreport-b7d7f6874    1/1           3m56s      41h

So no Vulnerabilities were scanned at all:

kubectl get vuln -A                                                                                                                           
No resources found

Jul 14 '22 07:07 VF-mbrauer

@erikgb @chen-keinan Aby news about that one. Until this gets finally solved I hesitate to rollout further. Jobs are still in a stuck state.

Jul 26 '22 16:07 VF-mbrauer

@VF-mbrauer this issue is under investigation, I will update you once we have a solid solution.

Jul 26 '22 19:07 chen-keinan

@chen-keinan any news on this one? Independent from any issue related to trivy-operator or trivy scanner, the job should be properly released and not get stuck for forever.

Sep 11 '22 16:09 VF-mbrauer

@VF-mbrauer you mention at the top that due to error #206 scan jobs getting stuck (meaning the scan job is completed but trivy-operator is unable to process the report) , I assume this is not the case now , am I right?

Sep 11 '22 16:09 chen-keinan

@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.

But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.

Sep 11 '22 16:09 VF-mbrauer

@chen-keinan Yes that is correct. That has been fixed with compression and will be further slimmed where you are working in a split in CRDs and reduce unneccesary stuff.

But if in future there is some new issue which will block job, we should be prepared. And therefore this ticked here is still there.

trivy-operator has the logic which know to delete jobs on completion (and report has been processed) or failure , the case where a scan has been completed but trivy-operator is unable to process a report example: #206 ,should be fixed as reported immediately , by passing it as a generic solution might lead to data lose or jobs overflow, wdyt ?

Sep 11 '22 16:09 chen-keinan

trivy-operator trivy-operator copied to clipboard

Faulty scan jobs blocking further scans from being executed

trivy-operator
trivy-operator copied to clipboard