scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Fix Maven JAR PURL detection for packages without metadata #1836

Open sarafarajnasardi opened this issue 4 months ago • 3 comments

  • Add maven.py module with enhanced JAR detection for Maven packages
  • Detect Maven JARs via pom.properties files and URL pattern analysis
  • Convert JAR PURLs to correct Maven format (pkg:jar → pkg:maven)
  • Add comprehensive test suite covering all detection scenarios
  • Update scan_codebase and inspect_packages pipelines

Fixes #1836

sarafarajnasardi avatar Sep 09 '25 11:09 sarafarajnasardi

Thanks... did you check existing code for reuse:

  • https://github.com/aboutcode-org/purldb/blob/main/minecode/miners/maven.py
  • https://github.com/aboutcode-org/purldb/blob/main/minecode/collectors/maven.py
  • https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/maven.py
  • https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/url2purl.py#L222
  • https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py#L320

It feels that your code may be duplicating existing maven-related code instead of reusing it ... reuse is always better.

pombredanne avatar Sep 09 '25 17:09 pombredanne

Thanks for the feedback @pombredanne! You're absolutely right about code reuse being better than duplication. I've updated the implementation to leverage existing utilities from the ScanCode ecosystem:

Code Reuse Implementation ✅

  • Added conditional imports for packagedcode.maven utilities from scancode-toolkit
  • Integrated packageurl.contrib.url2purl for URL-to-PURL conversion from packageurl-python
  • Used packagedcode.utils.get_base_purl for canonical PURL normalization
  • Maintained graceful fallbacks when external utilities are unavailable to ensure backward compatibility

Test Results ✅

All tests are now passing with 100% success rate:

$ docker exec -it scancodeio-web-1 python manage.py test scanpipe.tests.pipes.test_maven -v 2

Found 8 test(s).
System check identified no issues (0 silenced).

test_detect_maven_jars_from_input_source_url ... ok
test_detect_maven_jars_from_pom_properties_basic ... ok  
test_extract_maven_coordinates_from_pom_properties ... ok
test_extract_maven_coordinates_from_url_invalid ... ok
test_extract_maven_coordinates_from_url_maven_central ... ok
test_extract_maven_coordinates_missing_fields ... ok
test_no_maven_jars_detected ... ok
test_validate_maven_coordinates_against_jar_package ... ok

----------------------------------------------------------------------
Ran 8 tests in 0.392s
OK

The implementation now properly reuses existing code from purldb, scancode-toolkit, and packageurl-python repositories as requested, while maintaining robust fallback mechanisms for when optional dependencies aren't available. No breaking changes to the existing API.

sarafarajnasardi avatar Sep 10 '25 10:09 sarafarajnasardi

Thank you for the feedback! I've completely rewritten the Maven detection code to use existing ScanCode Toolkit functions with minimal custom logic. The new implementation leverages packagedcode.get_package_handler() directly instead of manual parsing, reducing the code from 60+ lines to 18 lines while maintaining all functionality.

sarafarajnasardi avatar Sep 11 '25 15:09 sarafarajnasardi