dependency-track perf: Handle Trivy scan results asynchronously

Use CompletableFutures with a FixedThreadPool Executor to asynchronously handle each scan result from Trivy

Description

This change allows the TrivyAnalysisTask to handle scan results asynchronously by borrowing the existing CompletableFutures / CountDownLatch pattern being used in the SnykAnalysisTask. The FixedThreadPool can be configured using TRIVY_THREAD_POOL_SIZE (default: 10).

Addressed Issue

fixes #3570

Additional Details

The check for SCANNER_TRIVY_IGNORE_UNFIXED is no longer checked for each vulnerability, as this is expensive and seemingly is not necessary. Instead it is checked once per analysis task.

Checklist

[x] I have read and understand the contributing guidelines
~[ ] This PR fixes a defect, and I have provided tests to verify that the fix is effective~
~[ ] This PR implements an enhancement, and I have provided tests to verify that it works as intended~
~[ ] This PR introduces changes to the database model, and I have added corresponding update logic~
~[ ] This PR introduces new or alters existing behavior, and I have updated the documentation accordingly~

Mar 20 '24 01:03 robert-blackman

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
:white_check_mark: +0.01% (target: -1.00%)	:white_check_mark: 87.10% (target: 70.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (3b52284f98526f533bd8b7b705e707f6f5847102)	21573	15971	74.03%
Head commit (28d1d42e68bbf2fe75474c7da063ea7f51d4d3f1)	21597 (+24)	15991 (+20)	74.04% (+0.01%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#3571)	31	27	87.10%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{You may notice some variations in coverage metrics with the latest Coverage engine update. For more details, visit the documentation}

Mar 20 '24 11:03 codacy-production[bot]

Thanks for the PR @robert-blackman! Very much appreciate folks looking into unreleased code, it's the best opportunity to tweak and fix things!

How is the performance holding up when uploading many BOMs? Processing records in a separate thread pool can potentially have the opposite effect. For example, with a worker pool thread of 20, DT could potentially be executing the VulnerabilityAnalysisTask up to 20 times concurrently. But now the Trivy thread pool acts as a bottleneck, as all 20 worker threads shove tasks into it.

Mar 22 '24 10:03 nscuro

Good point @nscuro! Maybe a better approach (perhaps a naive one 😂) would be to initialise a new executor with each TrivyAnalysisTask, terminating it after each analysis? This simple approach wouldn't allow for finer control over the total number of concurrent executions (not to mention the memory overhead), but would prevent tasks from blocking one-another for the most part. Any thoughts?

Mar 25 '24 08:03 robert-blackman

Maybe a better approach (perhaps a naive one 😂) would be to initialise a new executor with each TrivyAnalysisTask, terminating it after each analysis?

In that case we may end up creating more contention as threads compete for database connections. One TrivyAnalysisTask alone could hog all database connections in the connection pool, depending on how many threads we're allocating per task. You could test this by building a container image with your changes locally, and using it in the Docker Compose dev setup with monitoring. The Grafana dashboard includes metrics for the database connection pools.

I believe throwing concurrency at the problem might not solve it without creating new issues. Instead, we may want to look at making the code more efficient. One example would be here:

https://github.com/DependencyTrack/dependency-track/blob/35c7f862b2e30815465dde8cb6f691405304472c/src/main/java/org/dependencytrack/tasks/scanners/TrivyAnalysisTask.java#L447

Where we fetch a Component from the database, for each reported vulnerability. If multiple vulnerabilities are reported for the same component, we're doing redundant queries. By grouping the vulnerabilities by affected component, we should be able to minimize those queries.

~Similarly, super.isEnabled is executed once for each vulnerability:~ (Edit: You already found and fixed that)

https://github.com/DependencyTrack/dependency-track/blob/35c7f862b2e30815465dde8cb6f691405304472c/src/main/java/org/dependencytrack/tasks/scanners/TrivyAnalysisTask.java#L319

~The method executes a DB query behind the scenes, which again can be expensive. Really it only needs to be executed once when the task is started.~

There's also logic here that checks whether a vulnerability exists, and creates it if necessary:

https://github.com/DependencyTrack/dependency-track/blob/35c7f862b2e30815465dde8cb6f691405304472c/src/main/java/org/dependencytrack/tasks/scanners/TrivyAnalysisTask.java#L450-L456

This is executed for every Component<->Vulnerability pair. So if the same vulnerability is reported for multiple components, we're doing redundant work.

So, to summarize, there are multiple opportunities here to structure the code in such a way that:

Each component is only fetched once
Each vulnerability is only synchronized once
Configuration (isEnabled) is only fetched once

Mar 25 '24 20:03 nscuro

Thanks for the feedback @nscuro 🙂 My original approach really revolved around making the smallest change possible in order to meet the release window, but it did seem that the code in general could do with some optimisation. I'll have a go at implementing some form of local cache to handle the proposed optimisations, not sure if you had an idea for how that would best be implemented? A simple map would do the trick, but I'm not sure if something more robust would be warranted, for example caching vulnerabilities across scans 🤔

I do think (from our usage at least) that concurrency might still be necessary. For example, we are seeing some scans drop from 10-12 minutes down to 1:30-2:30 minutes. There are likely some threadpools out there that are better suited to this use case where fairness can be taken into account, did you have any thoughts on how concurrency might be best achieved here?

New plan:

do less things
(only then) do things faster 😉

Mar 27 '24 00:03 robert-blackman