BUG: Improve Package from PurlDB failures
Extracted from https://github.com/aboutcode-org/dejacode/issues/295#issuecomment-2824782627
Running "Improve Package from PurlDB" fails with duplicate key value violates unique constraint"component_catalog_packag_dataspace_id_type_namesp_c6620419_uniq"DETAIL:Key(dataspace_id,type,namespace,name,version,qualifiers,subpath,download_url,filename)=(3,npm,,parse-json,4.0.0,,,https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz,parse-json-4.0.0.tgz)alreadyexists. since assigning the download_url would make it a fully duplicate package.
Attempting to enhance the package in the product with data from PurlDB fails, because assigning the download URL would violate the uniqueness constraint that covers dataspace_id, type, namespace, name, version, qualifiers, subpath, download_url, and filename
This is a corner case where the data pulled from the PurlDB can trigger a "unique constraint" violation when applied to a package.
Context
PURL from a CycloneDX SBOM: pkg:npm/[email protected]
Note that in that data source context, no download_url is provided.
"There are multiple entries in the PurlDB for this Package." -> 4 results in the PurlDB for this PURL.
https://public.purldb.io/api/packages/?purl=pkg:npm/[email protected]
[
{
"uuid": "d1eb90f6-5115-49b8-a5f8-782e948cbd3d",
"filename": "parse-json-4.0.0.tgz",
"purl": "pkg:npm/[email protected]",
"type": "npm",
"name": "parse-json",
"version": "4.0.0",
"download_url": "https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz",
"sha1": "be35f5425be1f7f6c747184f98a788cb99477ee0",
"package_content": null,
"package_sets": []
},
{
"uuid": "bd3c0519-50d6-47a9-b01a-d3241e0ef641",
"filename": null,
"package_sets": [],
"purl": "pkg:npm/[email protected]",
"type": "npm",
"namespace": "",
"name": "parse-json",
"version": "4.0.0",
"download_url": "https://registry.npmjs.com/parse-json/-/parse-json-4.0.0.tgz",
"sha1": "be35f5425be1f7f6c747184f98a788cb99477ee0",
"package_sets": []
},
{
"uuid": "0ff3a534-5570-4289-8560-6e46bb3bad4f",
"filename": "parse-json-4.0.0.tgz",
"purl": "pkg:npm/%40types/[email protected]",
"type": "npm",
"namespace": "@types",
"name": "parse-json",
"version": "4.0.0",
"download_url": "https://registry.npmjs.org/@types/parse-json/-/parse-json-4.0.0.tgz",
"package_content": null,
"package_sets": [
{
"uuid": "721da624-9d02-401f-82b6-46142449635d",
"packages": [
"https://public.purldb.io/api/packages/0ff3a534-5570-4289-8560-6e46bb3bad4f/",
"https://public.purldb.io/api/packages/ec74c5e3-676b-489a-a059-4176bdfe6ea8/"
]
}
]
},
{
"uuid": "680f4c25-68b3-4e21-8c16-6db2a822e752",
"filename": null,
"purl": "pkg:npm/%40types/[email protected]",
"type": "npm",
"namespace": "@types",
"name": "parse-json",
"version": "4.0.0",
"download_url": "https://registry.npmjs.com/@types/parse-json/-/parse-json-4.0.0.tgz",
"package_content": null,
"package_sets": [
{
"uuid": "c4608309-c909-4a86-bf75-c3363d50bf98",
"packages": [
"https://public.purldb.io/api/packages/680f4c25-68b3-4e21-8c16-6db2a822e752/",
"https://public.purldb.io/api/packages/ec74c5e3-676b-489a-a059-4176bdfe6ea8/"
]
}
],
}
]
- Since the provided purl has no namespace, ie:
pkg:npm/[email protected], the entries forpkg:npm/%40types/[email protected]should not be returned - The default ordering of the results should be by each PURL field. From
packageurl.contrib.django.models.PackageURLQuerySetMixin.order_by_package_url:
PACKAGE_URL_FIELDS = ("type", "namespace", "name", "version", "qualifiers", "subpath")
def order_by_package_url(self):
"""Order by Package URL fields."""
return self.order_by(*PACKAGE_URL_FIELDS)
- The 2 entries for
pkg:npm/[email protected]are the same, thesha1id equal, the only difference is the domain in thedownload_url:npmjs.orgvsnpmjs.com. The issue in the DejaCode context using the PurlDB is which one of those package should we use? An approach could be to take the common fields values from the multiple entries. In this case, as the only know value on the DejaCode side ispkg:npm/[email protected], we could ignore the download_url and hash and import everything that is shared across the PurlDB records.
TODO (Fix) in DejaCode
-
The task should not break, any issues happening during the "Improve Packages from PurlDB" should be properly handled and logged during the
improve_packages_from_purldbtask. -
Using data pulled from PurlDB onto a package should not raise a unique constraint violation.
@ghsa-retrieval The issue initially reported at https://github.com/aboutcode-org/dejacode/issues/295#issuecomment-2824782627 should now be properly handled by the changes merged in main from https://github.com/aboutcode-org/dejacode/pull/304
Give it a try when you have a minute and let me know :)
Hi @tdruez I tried out SBOM import on Staging nexB with interesting results. I started by exporting a CycloneDX 1.6 SBOM from SCIO (see attached) and then imported it into an empty product, which went fine. I identified one of the 306 imported packages that was missing a license assignment but had a PurlDB entry with license info:
purl pkg:maven/biz.aQute.bnd/[email protected]?classifier=sources filename biz.aQute.bndlib-3.5.0-sources.jar download URL https://repo1.maven.org/maven2/biz/aQute/bnd/biz.aQute.bndlib/3.5.0/biz.aQute.bndlib-3.5.0-sources.jar
I ran Improve Packages from PurlDB and it updated the license information on the Package successfully; however, it did not update the license on the Product Package entry in my test product inventory. Should it? Is that a bug or rather something that we don't do yet? The user problem is that it is not at all obvious from the Product Inventory view that the Package now has license information, nor is there an obvious way to update the Inventory assignment with that license info. If I edit the product package I can see that the package license is now apache-2.0 and I can update that manually and that's good, but editing each product package with an empty license assignment to do that is not a very good solution. Should this be the subject of another DejaCode issue?
@tdruez The Improve process that I ran as described in my previous comment also resulted in an error, even though it successfully updated at least one package. Error screenshot attached.
@tdruez Earliest I can test this will be on Monday. Thank you for your work on this fix.
I have tested the patch and it avoids the constraint violation error I have previously seen. The behavior I am seeing is as follows:
- Initially I have two packages with the same PURL, one with full metadata while the other has no download URL
- The package that is lacking information was assigned to a product
- Improve package from PurlDB was run, there are no errors reported, updated is reported to be successful
- The package catalogue shows both packages, the notable difference is that the recently improved one remain without download URL
I believe that this should fix the issue we had, especially in combination with the patch for https://github.com/aboutcode-org/dejacode/issues/297 avoiding duplicates in the first place.
An alternative solution would have been to remove the package that would turn out to be a full duplicate if all data from PurlDB would have been added, including the download URL, and then attach the remaining package to the product. However, the case might be so niche that it is not relevant enough to try to do this compared to the effort it would take and additionally there are likely edge cases that I have not considered.
So overall this seems to be working as intended for packages and a good fix. I have not tested the case that @DennisClark was running into. Seems like this needs to be addressed in similar fashion for components by the looks of the error message.
I'm not sure if this is related to the patch or some strangeness surrounding Python packages in PurlDB, but it seems I cannot get DejaCode to assign download URLs. Since this may be entirely unrelated, I will file a separate issue though tomorrow.
Edit: It seems unrelated to me, but just for reference: https://github.com/aboutcode-org/dejacode/issues/307