dejacode icon indicating copy to clipboard operation
dejacode copied to clipboard

BUG: Improve package from PurlDB fails due to parameter in PURL

Open rogu-beta opened this issue 7 months ago • 12 comments

Describe the bug Working with DejaCode in a build of https://github.com/aboutcode-org/dejacode/commit/925d4045897da9d7b3de98b8ff3eda3c75b6833d I noticed that several Python package where not being assigned download URLs when using "Improve Package from PurlDB". The product was created by importing an SBOM and the populate_purldb pipeline was manually added in ScanCode.io on the project that was imported.

The following was noticed when checking the package that did not get assigned:

  • The tab for PurlDB was greyed-out, indicating that no matching package was found
  • Searching for the package name in DejaCode revealed several entries, but all of them with a parameter for the filename attached to the PURL

Examples:

It seems like there are multiple potential files that could be referenced by the PURL and the mapping is unclear or DejaCode only considers exact matches without considering parameters. This seems to be an issue if different distribution formats or target platforms / processor architectures exist?

To Reproduce Steps to reproduce the behavior:

  1. Deploy DejaCode, ScanCode.io, and PurlDB
  2. Create a product
  3. Import an SBOM containing package from pypi, such as boto3
  4. Run populate_purldb in ScanCode.io for the SBOM
  5. Use Improve Package from PurlDB and notice that some packages still do not have a download URL
  6. Check that there are entries in the PurlDB but they do not get matched to the package

Expected behavior If entries for the package exist, then DejaCode should be able to assign the download URL and other metadata to the package.

Screenshots

Image

Image

Context (OS, Browser, Device, etc.): n.a.

rogu-beta avatar May 15 '25 08:05 rogu-beta

This might as well be an issue on the PurlDB side, but it seems to me that technically the entries are correct so it should be on DejaCode's side to correctly apply them. I can see why this is currently not working and there might not be a correct solution if the mapping is not clear based on just the PURL. However, there also does not seem to be a way to manually fix the association in DejaCode, unless I'm mistaken

rogu-beta avatar May 15 '25 08:05 rogu-beta

@ghsa-retrieval Thanks for reporting this with such great details 👍

A solution was implemented in https://github.com/aboutcode-org/dejacode/pull/308 is available in the latest main branch and the new v5.3.0 release.

From https://github.com/aboutcode-org/dejacode/pull/308:

The recent changes included in #304 were a bit too restrictive in the PURL comparison. This PR excludes the PURL qualifiers and subpaths for the PURL comparison when matching data returned by the PURLDB.

  1. Those 2 PurlDB entries will now be available in the PurlDB tab.
  2. When using the "Improve Package from PurlDB" feature, the common data from those 2 PurlDB entries will be set on the package.

Now, we still have the problem where the download URL will not be set, as there is a choice to make between the 2 entries available in the PurlDB:

  • boto3-1.37.26-py3-none-any.whl
  • boto3-1.37.26.tar.gz

To help make that decision, DejaCode should leverage the new package_content field, recently added on the PurlDB side. See https://github.com/aboutcode-org/purldb/blob/main/packagedb/models.py#L436 and https://github.com/aboutcode-org/purldb/blob/main/packagedb/models.py#L520 for implementation details.

The idea is to store the type of package distributions, such as source or binary. The various types have a priority that could be used to select a default record from the PurlDB to be used on the DejaCode side when a generic PURL is provided.

For out boto example: PURL in a DejaCode product: pkg:pypi/[email protected] Entries in PurlDB:

  • pkg:pypi/[email protected]?file_name=boto3-1.37.26-py3-none-any.whl (package_content=BINARY)
  • pkg:pypi/[email protected]?file_name=boto3-1.37.26.tar.gz (package_content=SOURCE_ARCHIVE)

Since the source package has a higher priority than a binary, the imported download URL would be for the boto3-1.37.26.tar.gz entry.

@ghsa-retrieval Let me know if that logic make sense on your side.

The package_content values are available for most package types but not yet for PyPI, entered as https://github.com/aboutcode-org/purldb/issues/619 Also, some API enhancements are required, entered as https://github.com/aboutcode-org/purldb/issues/620

Next, will implement the package_content selection logic in DejaCode.

tdruez avatar May 28 '25 13:05 tdruez

@tdruez thank you so much for your work.

Yes, your proposal for the download URL makes sense to me. Also prioritizing the source archive is an excellent idea, as this will probably yield much better results in the license scans, if available. It is also a good default because it is platform agnostic. Note: I think your example entries for the PurlDB have BINARY and SOURCE_ARCHIVE swapped.

How would you handle the case if there is no source archive and only various binary distribution (different processor architecture and/or python version)? I guess in that case we would simply not have a download url because there is no clear mapping possible.

If this is implemented, it will be a really great improvement!

rogu-beta avatar May 28 '25 14:05 rogu-beta

Note: I think your example entries for the PurlDB have BINARY and SOURCE_ARCHIVE swapped.

Good catch, I've updated the comment:

boto3-1.37.26-py3-none-any.whl -> package_content=BINARY boto3-1.37.26.tar.gz - >package_content=SOURCE_ARCHIVE

How would you handle the case if there is no source archive and only various binary distribution (different processor architecture and/or python version)? I guess in that case we would simply not have a download url because there is no clear mapping possible.

In that case we would need more context to select the proper binary (os, python version, etc...). Currently, the data available on all PurlDB record is merged and the common values are used on the package, but no download URL. That should not be a major issue though, most Python package should have either a source package or a single wheel.

tdruez avatar May 28 '25 14:05 tdruez

@tdruez In tests with DejaCode 5.4.0 I see issues with mappings from packages to PurlDB entries, that may be related to the changes made in this ticket. I'm not sure if it already happened with the 5.3.0 release and I just missed it.

There is a package pkg:maven/com.fasterxml.jackson.core/[email protected]?type=jar among others that DejaCode is unable to find a matching PurlDB entry despite an exact match existing alongside the source package. As a consequence, "Improve Packages from PurlDB" also does not apply any changes.

Image Image Image

rogu-beta avatar Aug 18 '25 14:08 rogu-beta

I've created a separate issue for this so these can be tracked independently in case the cause is not related

rogu-beta avatar Aug 19 '25 07:08 rogu-beta

The root cause is in the comparison as the qualifier also has to be removed from PurlDB's PURL. Details are in issue #383

rogu-beta avatar Aug 22 '25 16:08 rogu-beta

@tdruez It looks like multiple commits have been made in PurlDB for the prerequisits you have mentioned in https://github.com/aboutcode-org/dejacode/issues/307#issuecomment-2916380214

Are these now met so the selection between multiple matching PurlDB entries to copy a suitable download URL can be made?

rogu-beta avatar Oct 08 '25 14:10 rogu-beta

I noticed that the exact mapping should already be possible. The issue with the Python packages is that after they have been imported from the SBOM, the only hash populated is SHA-256. The function to retrieve entries from PurlDB only considers the PURL, SHA-1, and download URL.

https://github.com/aboutcode-org/dejacode/blob/b6a09b0852ad47a139a3d0fe9a176f99acf0749a/component_catalog/models.py#L2544-L2550

If the other hashes are added as well, it should be working properly. This approach has upsides and downsides. It is 100% correct, as it will also use the download URL to the exact package used in the product. The downside for Python packages in particular is that the binary .whl packages will not have any license information to analyze unlike the source package.

rogu-beta avatar Nov 12 '25 09:11 rogu-beta

The suggested fix unfortunately does not work. The SBOM import leads to the package in DejaCode only having a SHA256 assigned but no download URL. For some reason the PurlDB entry for the package only has an MD5 hash assigned. Given that both have different hash types, they cannot be matched. Additionally, even if they did, the suggested approach would not work once DejaCode has more hashes populated than PurlDB since the query parameters are treated as logical and.

rogu-beta avatar Nov 12 '25 12:11 rogu-beta

The proper way to solve this would be the following:

  1. Patch PurlDB to pull SHA256 as well for PyPi (additionally check if other package manager can pull more hash types)
    • https://github.com/aboutcode-org/purldb/blob/10081dd502dcfc0953de333fe8afb399db5f2a88/minecode/miners/pypi.py#L134
    • https://github.com/aboutcode-org/purldb/blob/10081dd502dcfc0953de333fe8afb399db5f2a88/minecode/miners/pypi.py#L270
  2. Patch DejaCode's get_purldb_entries to selectively query PurlDB per hash type with find_packages then combine the results
  3. Patch other area where DejaCode retrieves data from PurlDB for display using the same logic (e.g. for data display in tab)

rogu-beta avatar Nov 12 '25 12:11 rogu-beta

Perhaps not ready for a pull request, but it seems this is working significantly better. Alternatively one could also go just by PURL and filter based on hashes locally.

    def get_purldb_entries(self, user, max_request_call=0, timeout=10):
        """
        Return the PurlDB entries that correspond to this Package instance.

        Matching on the following fields order:
        - Hash
        - Package URL
        - Download URL

        A `max_request_call` integer can be provided to limit the number of
        HTTP requests made to the PackageURL server.
        By default, one request will be made per field until a match is found.
        Providing max_request_call=1 will stop after the first request, even
        is nothing was found.
        """
        payloads = []
        purldb_entries = []

        package_url = self.package_url
        if self.sha256:
            payloads.append({"sha256": self.sha256})
        if self.sha1:
            payloads.append({"sha1": self.sha1})
        if self.md5:
            payloads.append({"md5": self.md5})
        if self.download_url:
            payloads.append({"download_url": self.download_url})
        if package_url:
            payloads.append({"purl": package_url})

        purldb = PurlDB(user.dataspace)
        for index, payload in enumerate(payloads):
            if max_request_call and index >= max_request_call:
                return

            if purldb_entries := purldb.find_packages(payload, timeout):
                break

        if not purldb_entries:
            return []

        # Cleanup the PurlDB entries:
        # Packages with different "plain" PURL are excluded. The qualifiers and
        # subpaths are not involved in this comparison.
        if package_url:
            purldb_entries = [
                entry
                for entry in purldb_entries
                if get_plain_purl(entry.get("purl", "")) == get_plain_purl(package_url)
            ]

        return purldb_entries

rogu-beta avatar Nov 13 '25 18:11 rogu-beta