dejacode icon indicating copy to clipboard operation
dejacode copied to clipboard

BUG: Packages being created with inadequate PURL data

Open DennisClark opened this issue 10 months ago • 2 comments

This problem actually is associated with multiple AboutCode projects, but the impact is most apparent to the DejaCode user. A recent import of an SBOM to a product in DejaCode resulted in the creation of 3 different package definitions for pkg:github/pypa/[email protected] each with a different download URL. A subsequent search for [email protected] turned up 2 older package definitions for pkg:pypi/[email protected] each with a different download URL. We don't have a problem of duplicate packages here, but the PURLs are not well defined and should contain additional details to differentiate them:

  • The 2 pypi packages should have a file_name qualifier.
  • The 3 github packages should have a subpath value.

Screenshot of the 5 [email protected] packages attached.

Image

DennisClark avatar Mar 05 '25 17:03 DennisClark

Further reference: https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#pypi and https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#github

DennisClark avatar Mar 05 '25 17:03 DennisClark

This needs a bit of discussion and design, as I'm assuming you expect the following "missing" values to be automated:

The 2 pypi packages should have a file_name qualifier.

This would mean changing the purl2url library to always return qualifiers such as the file_name for pretty much all the generated PURLs. This will have quite an impact.

Currently:

from packageurl.contrib import url2purl

url = "https://files.pythonhosted.org/packages/ab/11/2dc62c5263d9eb322f2f028f7b56cd9d096bb8988fcf82d65fa2e4057afe/pip-20.3.1-py2.py3-none-any.whl"

url2purl.get_purl(url)
PackageURL(type='pypi', namespace=None, name='pip', version='20.3.1', qualifiers={}, subpath=None)

Expected?

url2purl.get_purl(url)
PackageURL(type='pypi', namespace=None, name='pip', version='20.3.1', qualifiers={'file_name': 'pip-20.3.1-py2.py3-none-any.whl'}, subpath=None)

The 3 github packages should have a subpath value.

The PURL data was likely imported "as provided" in the SBOMS. I'm not sure we want to tamper with the provided PURL values during SBOM import.

For those URLs, the url2purl library already provide the subpath value:

>>> url2purl.get_purl("https://github.com/pypa/pip/blob/20.3.1/src/pip/_internal/models/wheel.py")
PackageURL(type='github', namespace='pypa', name='pip', version='20.3.1', qualifiers={}, subpath='src/pip/_internal/models/wheel.py')

but when a PURL is provided in the SBOM, DejaCode imports it as-is and does not try to override the provided data, which could be an unexpected behavior.

tdruez avatar Apr 21 '25 07:04 tdruez