harbor icon indicating copy to clipboard operation
harbor copied to clipboard

Vulnerability report 404 error due to timeout

Open hackehackspett opened this issue 2 years ago • 2 comments

We have encountered a problem with vulnerability scanning, particularly fetching the Trivy vulnerability reports and getting this error: [error]Unexpected response status: 404 Not Found [error]{"errors":[{"code":"NOT_FOUND","message":"report not found for project/repo@artifact}]}

After some debugging it seems like in order to fetch reports, Harbor checks if the artifact is scannable by the scanner, and if there is a timeout when pinging the scanner (due to high load, for example), it decides that the scanner does not have the capability to scan the artifact even if it does. The timeout is defined:

Timeout: time.Second * 5,

in src/pkg/scan/rest/v1/client.go which is used by src/controller/scan/base_controller.go when getting metadata for a scanner which includes the capabilites of the scanner. This timeout results in errors when the scanner is under high load and/or has limited resources and takes longer to respond. The resulting errors from this can also be somewhat misleading since it results in a 404 Not Found Error ("report not found for X") originating on the lines:

if !scannable {
    return nil, errors.NotFoundError(nil).WithMessage("report not found for %s@%s", artifact.RepositoryName, artifact.Digest)
}

in src/controller/scan/base_controller.go and this makes it seem like the report is not there when in fact the problem is not that the report is not found but rather the scanner did not respond in time to a request for metadata to check its capabilities to determine if the artifact is scannable by it. The question is how can this problem be mitigated other than reducing load or increasing resources for Trivy? We have some ideas for changes in the Harbor codebase:

The cache expiration: bc.cache, _ = cache.New(cache.Memory, cache.Expiration(time.Second*30)) in src/controller/scanner/base_controller.go could be made configurable (through env variable) so that it can be increased to avoid having to request metadata from the scanner and instead getting it from cache.

The scannable or hasCapability check could be changed or skipped? Not sure what the implications of this would be or why it works like this currently.

The request timeout could be made configurable (through env variable) so that it can be increased if necessary.

Would any of these changes make sense or have we missed or misunderstood something?

(Harbor version 2.6.0, deployed using Bitnami helm chart 15.2.2)

hackehackspett avatar Oct 05 '22 11:10 hackehackspett

can you provide the scanner log?

wy65701436 avatar Oct 10 '22 06:10 wy65701436

I’ve also experienced this issue when querying the vulnerability report for an image and based on examination of the source code I agree with the description above.

Looking at controller/scan/base_controller.go both Scan() and GetReport() makes the following calls:

artifacts, scannable, err := bc.collectScanningArtifacts(ctx, r, artifact)

It makes sense to confirm that an artifact is “scannable” before trying to start a scan but when it comes to fetching the report it is not that obvious; Either the report exists, or it doesn’t – why first check if it is scannable? But maybe there is a good reason for doing this because of some other design constraint that I didn’t grasp.

Anyway, what I have seen is that sometimes scannable = false for regular image artifacts that have already been scanned and as described above, this is just because the hasCapability() checks fails due to a cache-miss (>30 seconds after last update) in combination with a slow (>5 seconds) response from the trivy service for the GET request of /api/v1/metadata.

The logs from the Trivy service don’t add any information as nothing erroneous has occurred there. It was just that the service was a bit slow in responding to the API request. This typically happen when the CPU load is quite high (but not saturated) in combination with concurrent scanning of larger images (not necessarily the same image as the one for which a vulnerability report is requested).

Given that the capability of the Trivy adapter is semi-constant (could theoretically change after an upgrade I guess) it would be very appreciated if the 30-second cache retention could be prolonged. A configurable parameter, with default=30, would really help us as we would then increase this value (a lot) to minimize the risk of having to query the scanner’s capabilities at a time of slow responsiveness.

Regarding the 5-second timeout in the client making the API-request, it could be beneficial to also allow for adjustments of that. In many of our problematic situations we have noticed that the Trivy API responsiveness is just above 5 seconds but usually it is way faster – maybe 100 ms, so this is just about rare occasions.

A workaround for this problem is to allocate a lot of CPU resources to our Trivy pods to minimize the risk of slow API-responses that cause subsequent errors in the core service when trying to access vulnerability reports. But I would rather prefer to let the Trivy scanning run with less resources for a little longer time and use the CPU for other tasks.

uivraeus avatar Nov 11 '22 15:11 uivraeus

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions[bot] avatar Jan 11 '23 09:01 github-actions[bot]

I have also seen problems in Harbor that are caused by the issues described here. Is this something that is being worked on?

Danielkem avatar Jan 27 '23 07:01 Danielkem

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions[bot] avatar Mar 28 '23 09:03 github-actions[bot]

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.

github-actions[bot] avatar Apr 28 '23 09:04 github-actions[bot]