The MavenMetaAnalyzer task fails due to invalid URLs
Current Behavior
In the logs, I can see that the MavenMetaAnalyzer task tails due to invalid URLs formatted with parts of the PURL of a component:
compose-dtrack-apiserver-1 | 2024-03-18 08:51:29,133 INFO [InternalAnalysisTask] Starting internal analysis task
compose-dtrack-apiserver-1 | 2024-03-18 08:51:29,133 INFO [InternalAnalysisTask] Analyzing 171 component(s)
compose-dtrack-apiserver-1 | [Fatal Error] :1:10: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
compose-dtrack-apiserver-1 | 2024-03-18 08:51:31,639 ERROR [MavenMetaAnalyzer] Request failure
compose-dtrack-apiserver-1 | org.xml.sax.SAXParseException: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
compose-dtrack-apiserver-1 | at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
compose-dtrack-apiserver-1 | at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
compose-dtrack-apiserver-1 | at java.xml/javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
compose-dtrack-apiserver-1 | at org.dependencytrack.tasks.repositories.MavenMetaAnalyzer.analyze(MavenMetaAnalyzer.java:86)
compose-dtrack-apiserver-1 | at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.analyze(RepositoryMetaAnalyzerTask.java:177)
compose-dtrack-apiserver-1 | at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.lambda$analyze$0(RepositoryMetaAnalyzerTask.java:121)
compose-dtrack-apiserver-1 | at io.github.resilience4j.retry.Retry.lambda$decorateCallable$5(Retry.java:237)
compose-dtrack-apiserver-1 | at io.github.resilience4j.retry.Retry.executeCallable(Retry.java:373)
compose-dtrack-apiserver-1 | at org.dependencytrack.util.CacheStampedeBlocker.readThroughOrPopulateCache(CacheStampedeBlocker.java:201)
compose-dtrack-apiserver-1 | at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.analyze(RepositoryMetaAnalyzerTask.java:126)
compose-dtrack-apiserver-1 | at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.inform(RepositoryMetaAnalyzerTask.java:91)
compose-dtrack-apiserver-1 | at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:110)
compose-dtrack-apiserver-1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
compose-dtrack-apiserver-1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
compose-dtrack-apiserver-1 | at java.base/java.lang.Thread.run(Unknown Source)
(NB: The DOCTYPE probably stems from a plain HTTP response for a 404 page, but this is just a guess since the URL isn't logged)
It is however impossible to know which component(s) that cause this since the component name isn't logged in the analyze() method. If that would have been logged, one could have inspected+corrected the PURL of the component in the DB and error-traced the chain that led to the invalid PURL.
My suggestion is that:
- Something along the lines of "Analyzing component " + component gets logged in the analyze() method, for traceability
- The URL is validated before it gets passed to processHttpRequest
Steps to Reproduce
Hard to specify, since DTrack doesn't log which component is the root of the cause.
Expected Behavior
- The generated URL gets validated before it gets used. If invalid, a warning along the lines of
"Invalid url: " + urlgets logged - Each time the MavenMetaAnalyzer.analyze() method is called,
"Analyzing " + componentis logged for traceability
Dependency-Track Version
4.10.1
Dependency-Track Distribution
Container Image
Database Server
PostgreSQL
Database Server Version
13.13
Browser
N/A
Checklist
- [X] I have read and understand the contributing guidelines
- [X] I have checked the existing issues for whether this defect was already reported
Related to #3234.
I already added MDC usage to the new BomUploadProcessingTaskV2, we merely need to continue adding MDC wherever it makes sense.
https://github.com/DependencyTrack/dependency-track/blob/333c56d44a7db3447bb1e7126a05b8df6ea717b1/src/main/java/org/dependencytrack/tasks/BomUploadProcessingTaskV2.java#L148-L151
The benefit of using MDC is that it will attach the context variables to all log statements within its scope.
I'm thinking that, specifically for the repository meta analysis, we also want to include the name of the repository to which the request is made.