BUG: Improve package from PurlDB fails due to parameter in PURL
Describe the bug
Working with DejaCode in a build of https://github.com/aboutcode-org/dejacode/commit/925d4045897da9d7b3de98b8ff3eda3c75b6833d I noticed that several Python package where not being assigned download URLs when using "Improve Package from PurlDB". The product was created by importing an SBOM and the populate_purldb pipeline was manually added in ScanCode.io on the project that was imported.
The following was noticed when checking the package that did not get assigned:
- The tab for PurlDB was greyed-out, indicating that no matching package was found
- Searching for the package name in DejaCode revealed several entries, but all of them with a parameter for the filename attached to the PURL
Examples:
- PURL of package in product:
pkg:pypi/[email protected] - Entries in PurlDB:
pkg:pypi/[email protected]?file_name=boto3-1.37.26-py3-none-any.whlpkg:pypi/[email protected]?file_name=boto3-1.37.26.tar.gz
It seems like there are multiple potential files that could be referenced by the PURL and the mapping is unclear or DejaCode only considers exact matches without considering parameters. This seems to be an issue if different distribution formats or target platforms / processor architectures exist?
To Reproduce Steps to reproduce the behavior:
- Deploy DejaCode, ScanCode.io, and PurlDB
- Create a product
- Import an SBOM containing package from pypi, such as boto3
- Run
populate_purldbin ScanCode.io for the SBOM - Use
Improve Package from PurlDBand notice that some packages still do not have a download URL - Check that there are entries in the PurlDB but they do not get matched to the package
Expected behavior If entries for the package exist, then DejaCode should be able to assign the download URL and other metadata to the package.
Screenshots
Context (OS, Browser, Device, etc.): n.a.
This might as well be an issue on the PurlDB side, but it seems to me that technically the entries are correct so it should be on DejaCode's side to correctly apply them. I can see why this is currently not working and there might not be a correct solution if the mapping is not clear based on just the PURL. However, there also does not seem to be a way to manually fix the association in DejaCode, unless I'm mistaken
@ghsa-retrieval Thanks for reporting this with such great details 👍
A solution was implemented in https://github.com/aboutcode-org/dejacode/pull/308 is available in the latest main branch and the new v5.3.0 release.
From https://github.com/aboutcode-org/dejacode/pull/308:
The recent changes included in #304 were a bit too restrictive in the PURL comparison. This PR excludes the PURL qualifiers and subpaths for the PURL comparison when matching data returned by the PURLDB.
- Those 2 PurlDB entries will now be available in the PurlDB tab.
- When using the "Improve Package from PurlDB" feature, the common data from those 2 PurlDB entries will be set on the package.
Now, we still have the problem where the download URL will not be set, as there is a choice to make between the 2 entries available in the PurlDB:
boto3-1.37.26-py3-none-any.whlboto3-1.37.26.tar.gz
To help make that decision, DejaCode should leverage the new package_content field, recently added on the PurlDB side.
See https://github.com/aboutcode-org/purldb/blob/main/packagedb/models.py#L436 and https://github.com/aboutcode-org/purldb/blob/main/packagedb/models.py#L520 for implementation details.
The idea is to store the type of package distributions, such as source or binary. The various types have a priority that could be used to select a default record from the PurlDB to be used on the DejaCode side when a generic PURL is provided.
For out boto example:
PURL in a DejaCode product: pkg:pypi/[email protected]
Entries in PurlDB:
pkg:pypi/[email protected]?file_name=boto3-1.37.26-py3-none-any.whl(package_content=BINARY)pkg:pypi/[email protected]?file_name=boto3-1.37.26.tar.gz(package_content=SOURCE_ARCHIVE)
Since the source package has a higher priority than a binary, the imported download URL would be for the boto3-1.37.26.tar.gz entry.
@ghsa-retrieval Let me know if that logic make sense on your side.
The package_content values are available for most package types but not yet for PyPI, entered as https://github.com/aboutcode-org/purldb/issues/619
Also, some API enhancements are required, entered as https://github.com/aboutcode-org/purldb/issues/620
Next, will implement the package_content selection logic in DejaCode.
@tdruez thank you so much for your work.
Yes, your proposal for the download URL makes sense to me. Also prioritizing the source archive is an excellent idea, as this will probably yield much better results in the license scans, if available. It is also a good default because it is platform agnostic. Note: I think your example entries for the PurlDB have BINARY and SOURCE_ARCHIVE swapped.
How would you handle the case if there is no source archive and only various binary distribution (different processor architecture and/or python version)? I guess in that case we would simply not have a download url because there is no clear mapping possible.
If this is implemented, it will be a really great improvement!
Note: I think your example entries for the PurlDB have BINARY and SOURCE_ARCHIVE swapped.
Good catch, I've updated the comment:
boto3-1.37.26-py3-none-any.whl -> package_content=BINARY
boto3-1.37.26.tar.gz - >package_content=SOURCE_ARCHIVE
How would you handle the case if there is no source archive and only various binary distribution (different processor architecture and/or python version)? I guess in that case we would simply not have a download url because there is no clear mapping possible.
In that case we would need more context to select the proper binary (os, python version, etc...). Currently, the data available on all PurlDB record is merged and the common values are used on the package, but no download URL. That should not be a major issue though, most Python package should have either a source package or a single wheel.
@tdruez In tests with DejaCode 5.4.0 I see issues with mappings from packages to PurlDB entries, that may be related to the changes made in this ticket. I'm not sure if it already happened with the 5.3.0 release and I just missed it.
There is a package pkg:maven/com.fasterxml.jackson.core/[email protected]?type=jar among others that DejaCode is unable to find a matching PurlDB entry despite an exact match existing alongside the source package. As a consequence, "Improve Packages from PurlDB" also does not apply any changes.
I've created a separate issue for this so these can be tracked independently in case the cause is not related
The root cause is in the comparison as the qualifier also has to be removed from PurlDB's PURL. Details are in issue #383
@tdruez It looks like multiple commits have been made in PurlDB for the prerequisits you have mentioned in https://github.com/aboutcode-org/dejacode/issues/307#issuecomment-2916380214
Are these now met so the selection between multiple matching PurlDB entries to copy a suitable download URL can be made?
I noticed that the exact mapping should already be possible. The issue with the Python packages is that after they have been imported from the SBOM, the only hash populated is SHA-256. The function to retrieve entries from PurlDB only considers the PURL, SHA-1, and download URL.
https://github.com/aboutcode-org/dejacode/blob/b6a09b0852ad47a139a3d0fe9a176f99acf0749a/component_catalog/models.py#L2544-L2550
If the other hashes are added as well, it should be working properly. This approach has upsides and downsides. It is 100% correct, as it will also use the download URL to the exact package used in the product. The downside for Python packages in particular is that the binary .whl packages will not have any license information to analyze unlike the source package.
The suggested fix unfortunately does not work. The SBOM import leads to the package in DejaCode only having a SHA256 assigned but no download URL. For some reason the PurlDB entry for the package only has an MD5 hash assigned. Given that both have different hash types, they cannot be matched. Additionally, even if they did, the suggested approach would not work once DejaCode has more hashes populated than PurlDB since the query parameters are treated as logical and.
The proper way to solve this would be the following:
- Patch PurlDB to pull SHA256 as well for PyPi (additionally check if other package manager can pull more hash types)
- https://github.com/aboutcode-org/purldb/blob/10081dd502dcfc0953de333fe8afb399db5f2a88/minecode/miners/pypi.py#L134
- https://github.com/aboutcode-org/purldb/blob/10081dd502dcfc0953de333fe8afb399db5f2a88/minecode/miners/pypi.py#L270
- Patch DejaCode's
get_purldb_entriesto selectively query PurlDB per hash type withfind_packagesthen combine the results - Patch other area where DejaCode retrieves data from PurlDB for display using the same logic (e.g. for data display in tab)
Perhaps not ready for a pull request, but it seems this is working significantly better. Alternatively one could also go just by PURL and filter based on hashes locally.
def get_purldb_entries(self, user, max_request_call=0, timeout=10):
"""
Return the PurlDB entries that correspond to this Package instance.
Matching on the following fields order:
- Hash
- Package URL
- Download URL
A `max_request_call` integer can be provided to limit the number of
HTTP requests made to the PackageURL server.
By default, one request will be made per field until a match is found.
Providing max_request_call=1 will stop after the first request, even
is nothing was found.
"""
payloads = []
purldb_entries = []
package_url = self.package_url
if self.sha256:
payloads.append({"sha256": self.sha256})
if self.sha1:
payloads.append({"sha1": self.sha1})
if self.md5:
payloads.append({"md5": self.md5})
if self.download_url:
payloads.append({"download_url": self.download_url})
if package_url:
payloads.append({"purl": package_url})
purldb = PurlDB(user.dataspace)
for index, payload in enumerate(payloads):
if max_request_call and index >= max_request_call:
return
if purldb_entries := purldb.find_packages(payload, timeout):
break
if not purldb_entries:
return []
# Cleanup the PurlDB entries:
# Packages with different "plain" PURL are excluded. The qualifiers and
# subpaths are not involved in this comparison.
if package_url:
purldb_entries = [
entry
for entry in purldb_entries
if get_plain_purl(entry.get("purl", "")) == get_plain_purl(package_url)
]
return purldb_entries