dejacode icon indicating copy to clipboard operation
dejacode copied to clipboard

BUG: Improve Package from PurlDB failures

Open tdruez opened this issue 7 months ago • 6 comments

Extracted from https://github.com/aboutcode-org/dejacode/issues/295#issuecomment-2824782627

Running "Improve Package from PurlDB" fails with duplicate key value violates unique constraint"component_catalog_packag_dataspace_id_type_namesp_c6620419_uniq"DETAIL:Key(dataspace_id,type,namespace,name,version,qualifiers,subpath,download_url,filename)=(3,npm,,parse-json,4.0.0,,,https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz,parse-json-4.0.0.tgz)alreadyexists. since assigning the download_url would make it a fully duplicate package.

Attempting to enhance the package in the product with data from PurlDB fails, because assigning the download URL would violate the uniqueness constraint that covers dataspace_id, type, namespace, name, version, qualifiers, subpath, download_url, and filename

This is a corner case where the data pulled from the PurlDB can trigger a "unique constraint" violation when applied to a package.

Context

PURL from a CycloneDX SBOM: pkg:npm/[email protected] Note that in that data source context, no download_url is provided.

"There are multiple entries in the PurlDB for this Package." -> 4 results in the PurlDB for this PURL.

https://public.purldb.io/api/packages/?purl=pkg:npm/[email protected]

[
        {
            "uuid": "d1eb90f6-5115-49b8-a5f8-782e948cbd3d",
            "filename": "parse-json-4.0.0.tgz",
            "purl": "pkg:npm/[email protected]",
            "type": "npm",
            "name": "parse-json",
            "version": "4.0.0",
            "download_url": "https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz",
            "sha1": "be35f5425be1f7f6c747184f98a788cb99477ee0",
            "package_content": null,
            "package_sets": []
        },
        {
            "uuid": "bd3c0519-50d6-47a9-b01a-d3241e0ef641",
            "filename": null,
            "package_sets": [],
            "purl": "pkg:npm/[email protected]",
            "type": "npm",
            "namespace": "",
            "name": "parse-json",
            "version": "4.0.0",
            "download_url": "https://registry.npmjs.com/parse-json/-/parse-json-4.0.0.tgz",
            "sha1": "be35f5425be1f7f6c747184f98a788cb99477ee0",
            "package_sets": []
        },
        {
            "uuid": "0ff3a534-5570-4289-8560-6e46bb3bad4f",
            "filename": "parse-json-4.0.0.tgz",
            "purl": "pkg:npm/%40types/[email protected]",
            "type": "npm",
            "namespace": "@types",
            "name": "parse-json",
            "version": "4.0.0",
            "download_url": "https://registry.npmjs.org/@types/parse-json/-/parse-json-4.0.0.tgz",
            "package_content": null,
            "package_sets": [
                {
                    "uuid": "721da624-9d02-401f-82b6-46142449635d",
                    "packages": [
                        "https://public.purldb.io/api/packages/0ff3a534-5570-4289-8560-6e46bb3bad4f/",
                        "https://public.purldb.io/api/packages/ec74c5e3-676b-489a-a059-4176bdfe6ea8/"
                    ]
                }
            ]
        },
        {
            "uuid": "680f4c25-68b3-4e21-8c16-6db2a822e752",
            "filename": null,
            "purl": "pkg:npm/%40types/[email protected]",
            "type": "npm",
            "namespace": "@types",
            "name": "parse-json",
            "version": "4.0.0",
            "download_url": "https://registry.npmjs.com/@types/parse-json/-/parse-json-4.0.0.tgz",
            "package_content": null,
            "package_sets": [
                {
                    "uuid": "c4608309-c909-4a86-bf75-c3363d50bf98",
                    "packages": [
                        "https://public.purldb.io/api/packages/680f4c25-68b3-4e21-8c16-6db2a822e752/",
                        "https://public.purldb.io/api/packages/ec74c5e3-676b-489a-a059-4176bdfe6ea8/"
                    ]
                }
            ],
        }
    ]
  1. Since the provided purl has no namespace, ie: pkg:npm/[email protected], the entries for pkg:npm/%40types/[email protected] should not be returned
  2. The default ordering of the results should be by each PURL field. From packageurl.contrib.django.models.PackageURLQuerySetMixin.order_by_package_url:
PACKAGE_URL_FIELDS = ("type", "namespace", "name", "version", "qualifiers", "subpath")
def order_by_package_url(self):
    """Order by Package URL fields."""
    return self.order_by(*PACKAGE_URL_FIELDS)
  1. The 2 entries for pkg:npm/[email protected] are the same, the sha1 id equal, the only difference is the domain in the download_url: npmjs.org vs npmjs.com. The issue in the DejaCode context using the PurlDB is which one of those package should we use? An approach could be to take the common fields values from the multiple entries. In this case, as the only know value on the DejaCode side is pkg:npm/[email protected], we could ignore the download_url and hash and import everything that is shared across the PurlDB records.

TODO (Fix) in DejaCode

  1. The task should not break, any issues happening during the "Improve Packages from PurlDB" should be properly handled and logged during the improve_packages_from_purldb task.

  2. Using data pulled from PurlDB onto a package should not raise a unique constraint violation.

tdruez avatar May 05 '25 11:05 tdruez

@ghsa-retrieval The issue initially reported at https://github.com/aboutcode-org/dejacode/issues/295#issuecomment-2824782627 should now be properly handled by the changes merged in main from https://github.com/aboutcode-org/dejacode/pull/304 Give it a try when you have a minute and let me know :)

tdruez avatar May 07 '25 15:05 tdruez

Hi @tdruez I tried out SBOM import on Staging nexB with interesting results. I started by exporting a CycloneDX 1.6 SBOM from SCIO (see attached) and then imported it into an empty product, which went fine. I identified one of the 306 imported packages that was missing a license assignment but had a PurlDB entry with license info:

purl pkg:maven/biz.aQute.bnd/[email protected]?classifier=sources filename biz.aQute.bndlib-3.5.0-sources.jar download URL https://repo1.maven.org/maven2/biz/aQute/bnd/biz.aQute.bndlib/3.5.0/biz.aQute.bndlib-3.5.0-sources.jar

I ran Improve Packages from PurlDB and it updated the license information on the Package successfully; however, it did not update the license on the Product Package entry in my test product inventory. Should it? Is that a bug or rather something that we don't do yet? The user problem is that it is not at all obvious from the Product Inventory view that the Package now has license information, nor is there an obvious way to update the Inventory assignment with that license info. If I edit the product package I can see that the package license is now apache-2.0 and I can update that manually and that's good, but editing each product package with an empty license assignment to do that is not a very good solution. Should this be the subject of another DejaCode issue?

scancodeio_paxd2d_results-2025-05-07-22-58-53.cdx.json.zip

DennisClark avatar May 07 '25 23:05 DennisClark

@tdruez The Improve process that I ran as described in my previous comment also resulted in an error, even though it successfully updated at least one package. Error screenshot attached.

Image

DennisClark avatar May 07 '25 23:05 DennisClark

@tdruez Earliest I can test this will be on Monday. Thank you for your work on this fix.

rogu-beta avatar May 08 '25 08:05 rogu-beta

I have tested the patch and it avoids the constraint violation error I have previously seen. The behavior I am seeing is as follows:

  • Initially I have two packages with the same PURL, one with full metadata while the other has no download URL
  • The package that is lacking information was assigned to a product
  • Improve package from PurlDB was run, there are no errors reported, updated is reported to be successful
  • The package catalogue shows both packages, the notable difference is that the recently improved one remain without download URL

I believe that this should fix the issue we had, especially in combination with the patch for https://github.com/aboutcode-org/dejacode/issues/297 avoiding duplicates in the first place.

An alternative solution would have been to remove the package that would turn out to be a full duplicate if all data from PurlDB would have been added, including the download URL, and then attach the remaining package to the product. However, the case might be so niche that it is not relevant enough to try to do this compared to the effort it would take and additionally there are likely edge cases that I have not considered.

So overall this seems to be working as intended for packages and a good fix. I have not tested the case that @DennisClark was running into. Seems like this needs to be addressed in similar fashion for components by the looks of the error message.

rogu-beta avatar May 14 '25 07:05 rogu-beta

I'm not sure if this is related to the patch or some strangeness surrounding Python packages in PurlDB, but it seems I cannot get DejaCode to assign download URLs. Since this may be entirely unrelated, I will file a separate issue though tomorrow.

Edit: It seems unrelated to me, but just for reference: https://github.com/aboutcode-org/dejacode/issues/307

rogu-beta avatar May 14 '25 15:05 rogu-beta