Fix Maven JAR PURL detection for packages without metadata #1836
- Add maven.py module with enhanced JAR detection for Maven packages
- Detect Maven JARs via pom.properties files and URL pattern analysis
- Convert JAR PURLs to correct Maven format (pkg:jar → pkg:maven)
- Add comprehensive test suite covering all detection scenarios
- Update scan_codebase and inspect_packages pipelines
Fixes #1836
Thanks... did you check existing code for reuse:
- https://github.com/aboutcode-org/purldb/blob/main/minecode/miners/maven.py
- https://github.com/aboutcode-org/purldb/blob/main/minecode/collectors/maven.py
- https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/maven.py
- https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/url2purl.py#L222
- https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py#L320
It feels that your code may be duplicating existing maven-related code instead of reusing it ... reuse is always better.
Thanks for the feedback @pombredanne! You're absolutely right about code reuse being better than duplication. I've updated the implementation to leverage existing utilities from the ScanCode ecosystem:
Code Reuse Implementation ✅
-
Added conditional imports for
packagedcode.mavenutilities from scancode-toolkit -
Integrated
packageurl.contrib.url2purlfor URL-to-PURL conversion from packageurl-python -
Used
packagedcode.utils.get_base_purlfor canonical PURL normalization - Maintained graceful fallbacks when external utilities are unavailable to ensure backward compatibility
Test Results ✅
All tests are now passing with 100% success rate:
$ docker exec -it scancodeio-web-1 python manage.py test scanpipe.tests.pipes.test_maven -v 2
Found 8 test(s).
System check identified no issues (0 silenced).
test_detect_maven_jars_from_input_source_url ... ok
test_detect_maven_jars_from_pom_properties_basic ... ok
test_extract_maven_coordinates_from_pom_properties ... ok
test_extract_maven_coordinates_from_url_invalid ... ok
test_extract_maven_coordinates_from_url_maven_central ... ok
test_extract_maven_coordinates_missing_fields ... ok
test_no_maven_jars_detected ... ok
test_validate_maven_coordinates_against_jar_package ... ok
----------------------------------------------------------------------
Ran 8 tests in 0.392s
OK
The implementation now properly reuses existing code from purldb, scancode-toolkit, and packageurl-python repositories as requested, while maintaining robust fallback mechanisms for when optional dependencies aren't available. No breaking changes to the existing API.
Thank you for the feedback! I've completely rewritten the Maven detection code to use existing ScanCode Toolkit functions with minimal custom logic. The new implementation leverages packagedcode.get_package_handler() directly instead of manual parsing, reducing the code from 60+ lines to 18 lines while maintaining all functionality.