perf: Handle Trivy scan results asynchronously
Use CompletableFutures with a FixedThreadPool Executor to asynchronously handle each scan result from Trivy
Description
This change allows the TrivyAnalysisTask to handle scan results asynchronously by borrowing the existing CompletableFutures / CountDownLatch pattern being used in the SnykAnalysisTask. The FixedThreadPool can be configured using TRIVY_THREAD_POOL_SIZE (default: 10).
Addressed Issue
fixes #3570
Additional Details
- The check for
SCANNER_TRIVY_IGNORE_UNFIXEDis no longer checked for each vulnerability, as this is expensive and seemingly is not necessary. Instead it is checked once per analysis task.
Checklist
- [x] I have read and understand the contributing guidelines
- ~[ ] This PR fixes a defect, and I have provided tests to verify that the fix is effective~
- ~[ ] This PR implements an enhancement, and I have provided tests to verify that it works as intended~
- ~[ ] This PR introduces changes to the database model, and I have added corresponding update logic~
- ~[ ] This PR introduces new or alters existing behavior, and I have updated the documentation accordingly~
Coverage summary from Codacy
See diff coverage on Codacy
| Coverage variation | Diff coverage |
|---|---|
| :white_check_mark: +0.01% (target: -1.00%) | :white_check_mark: 87.10% (target: 70.00%) |
Coverage variation details
| Coverable lines | Covered lines | Coverage | |
|---|---|---|---|
| Common ancestor commit (3b52284f98526f533bd8b7b705e707f6f5847102) | 21573 | 15971 | 74.03% |
| Head commit (28d1d42e68bbf2fe75474c7da063ea7f51d4d3f1) | 21597 (+24) | 15991 (+20) | 74.04% (+0.01%) |
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>
Diff coverage details
| Coverable lines | Covered lines | Diff coverage | |
|---|---|---|---|
| Pull request (#3571) | 31 | 27 | 87.10% |
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%
See your quality gate settings Change summary preferences
You may notice some variations in coverage metrics with the latest Coverage engine update. For more details, visit the documentation
Thanks for the PR @robert-blackman! Very much appreciate folks looking into unreleased code, it's the best opportunity to tweak and fix things!
How is the performance holding up when uploading many BOMs? Processing records in a separate thread pool can potentially have the opposite effect. For example, with a worker pool thread of 20, DT could potentially be executing the VulnerabilityAnalysisTask up to 20 times concurrently. But now the Trivy thread pool acts as a bottleneck, as all 20 worker threads shove tasks into it.
Good point @nscuro! Maybe a better approach (perhaps a naive one 😂) would be to initialise a new executor with each TrivyAnalysisTask, terminating it after each analysis? This simple approach wouldn't allow for finer control over the total number of concurrent executions (not to mention the memory overhead), but would prevent tasks from blocking one-another for the most part. Any thoughts?
Maybe a better approach (perhaps a naive one 😂) would be to initialise a new executor with each TrivyAnalysisTask, terminating it after each analysis?
In that case we may end up creating more contention as threads compete for database connections. One TrivyAnalysisTask alone could hog all database connections in the connection pool, depending on how many threads we're allocating per task. You could test this by building a container image with your changes locally, and using it in the Docker Compose dev setup with monitoring. The Grafana dashboard includes metrics for the database connection pools.
I believe throwing concurrency at the problem might not solve it without creating new issues. Instead, we may want to look at making the code more efficient. One example would be here:
https://github.com/DependencyTrack/dependency-track/blob/35c7f862b2e30815465dde8cb6f691405304472c/src/main/java/org/dependencytrack/tasks/scanners/TrivyAnalysisTask.java#L447
Where we fetch a Component from the database, for each reported vulnerability. If multiple vulnerabilities are reported for the same component, we're doing redundant queries. By grouping the vulnerabilities by affected component, we should be able to minimize those queries.
~Similarly, super.isEnabled is executed once for each vulnerability:~ (Edit: You already found and fixed that)
https://github.com/DependencyTrack/dependency-track/blob/35c7f862b2e30815465dde8cb6f691405304472c/src/main/java/org/dependencytrack/tasks/scanners/TrivyAnalysisTask.java#L319
~The method executes a DB query behind the scenes, which again can be expensive. Really it only needs to be executed once when the task is started.~
There's also logic here that checks whether a vulnerability exists, and creates it if necessary:
https://github.com/DependencyTrack/dependency-track/blob/35c7f862b2e30815465dde8cb6f691405304472c/src/main/java/org/dependencytrack/tasks/scanners/TrivyAnalysisTask.java#L450-L456
This is executed for every Component<->Vulnerability pair. So if the same vulnerability is reported for multiple components, we're doing redundant work.
So, to summarize, there are multiple opportunities here to structure the code in such a way that:
- Each component is only fetched once
- Each vulnerability is only synchronized once
- Configuration (
isEnabled) is only fetched once
Thanks for the feedback @nscuro 🙂 My original approach really revolved around making the smallest change possible in order to meet the release window, but it did seem that the code in general could do with some optimisation. I'll have a go at implementing some form of local cache to handle the proposed optimisations, not sure if you had an idea for how that would best be implemented? A simple map would do the trick, but I'm not sure if something more robust would be warranted, for example caching vulnerabilities across scans 🤔
I do think (from our usage at least) that concurrency might still be necessary. For example, we are seeing some scans drop from 10-12 minutes down to 1:30-2:30 minutes. There are likely some threadpools out there that are better suited to this use case where fairness can be taken into account, did you have any thoughts on how concurrency might be best achieved here?
New plan:
- do less things
- (only then) do things faster 😉