BUG: Packages being created with inadequate PURL data
This problem actually is associated with multiple AboutCode projects, but the impact is most apparent to the DejaCode user. A recent import of an SBOM to a product in DejaCode resulted in the creation of 3 different package definitions for pkg:github/pypa/[email protected] each with a different download URL. A subsequent search for [email protected] turned up 2 older package definitions for pkg:pypi/[email protected] each with a different download URL. We don't have a problem of duplicate packages here, but the PURLs are not well defined and should contain additional details to differentiate them:
- The 2 pypi packages should have a file_name qualifier.
- The 3 github packages should have a subpath value.
Screenshot of the 5 [email protected] packages attached.
Further reference: https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#pypi and https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#github
This needs a bit of discussion and design, as I'm assuming you expect the following "missing" values to be automated:
The 2 pypi packages should have a file_name qualifier.
This would mean changing the purl2url library to always return qualifiers such as the file_name for pretty much all the generated PURLs. This will have quite an impact.
Currently:
from packageurl.contrib import url2purl
url = "https://files.pythonhosted.org/packages/ab/11/2dc62c5263d9eb322f2f028f7b56cd9d096bb8988fcf82d65fa2e4057afe/pip-20.3.1-py2.py3-none-any.whl"
url2purl.get_purl(url)
PackageURL(type='pypi', namespace=None, name='pip', version='20.3.1', qualifiers={}, subpath=None)
Expected?
url2purl.get_purl(url)
PackageURL(type='pypi', namespace=None, name='pip', version='20.3.1', qualifiers={'file_name': 'pip-20.3.1-py2.py3-none-any.whl'}, subpath=None)
The 3 github packages should have a subpath value.
The PURL data was likely imported "as provided" in the SBOMS. I'm not sure we want to tamper with the provided PURL values during SBOM import.
For those URLs, the url2purl library already provide the subpath value:
>>> url2purl.get_purl("https://github.com/pypa/pip/blob/20.3.1/src/pip/_internal/models/wheel.py")
PackageURL(type='github', namespace='pypa', name='pip', version='20.3.1', qualifiers={}, subpath='src/pip/_internal/models/wheel.py')
but when a PURL is provided in the SBOM, DejaCode imports it as-is and does not try to override the provided data, which could be an unexpected behavior.