dejacode icon indicating copy to clipboard operation
dejacode copied to clipboard

BUG: SBOM import does not trigger scan of packages

Open rogu-beta opened this issue 1 year ago • 18 comments

Describe the bug On a self-hosted instance of DejaCode, it appears that the current main branch of DejaCode does not scan individual packages after loading the SBOM. This feature seems to work on the public demo instance.

Tested with:

  • 1c4fb5e1a0e2c61c26c874fb8afb29cf474f5bf5
  • 4fcbe39b502438a7e7a5faa957c539f1f03f2ade

To Reproduce Configure dataspace:

  1. In "Application Process Settings" activate "Enable package scanning"
  2. In "Application Process Settings" activate "Update packages automatically from scan"

Steps to reproduce the behavior:

  1. Create a product
  2. Open the product
  3. Click on the "Scan" dropdown and select "Load Packages from SBOMs"
  4. Select an SBOM of your choice (e.g. sbom-1-4.cdx.json)
  5. Enable "Update existing packages with discovered packages data"
  6. Enable "Scan all packages of this product post-import"

Additional information which may or may not be relevant:

  • I renamed and edited the nexB dataspace for this (which also locks me out of creating new dataspace, not sure if that is expected?)
  • "Enable PurlDB access" is deactivated
  • "Enable VulnerableCodeDB access" is deactivated
  • The PurlDB URL is still in the configuration

Expected behavior After loading the packages through the load_sbom pipeline in ScanCode.io, each individual package should be analyzed with a scan_single_package pipeline and the results added to the respective packages in DejaCode.

Screenshots No screenshots, as error is that actions are not happening

Context (OS, Browser, Device, etc.): Firefox

rogu-beta avatar May 15 '24 16:05 rogu-beta

@ghsa-retrieval Could you confirm that the ScanCode.io integration is properly configured on your DejaCode instance? Click on your username in the top right corner to display the dropdown menu and select "Integration Status" or directly use this URL /integrations_status/ From this view, we can make sure that ScanCode.io is "Configured" and "Available".


I renamed and edited the nexB dataspace for this (which also locks me out of creating new dataspace, not sure if that is expected?)

You need to update the REFERENCE_DATASPACE setting https://dejacode.readthedocs.io/en/latest/application-settings.html#reference-dataspace accordingly to the renaming to ensure your Dataspace and related users have those permissions.

tdruez avatar May 16 '24 03:05 tdruez

@tdruez Yes, it shows both "Configured" and "Available" with a green checkmark. The load_sbom pipeline works (with limitations) and packages are being added to the project, but they are not scanned individually to get detailed license and copyright information. The scanning for those details also works if I add a single package with "Add Package" and an URL to the package's archive. So some parts of the integration are definitely working.

You need to update the REFERENCE_DATASPACE setting https://dejacode.readthedocs.io/en/latest/application-settings.html#reference-dataspace accordingly to the renaming to ensure your Dataspace and related users have those permissions.

Makes sense, that was just a bit unexpected when configuring it through the UI.

rogu-beta avatar May 16 '24 06:05 rogu-beta

The same issue seems to happen when using "Scan" > "Scan All Packages". The UI reports that the job has been successfully submitted, but they never appear in the scan list nor does ScanCode.io list new projects. Hence, this might not be related to the SBOM import itself.

2024-05-16-dejacode-scan-all-packages

rogu-beta avatar May 16 '24 07:05 rogu-beta

@ghsa-retrieval Thanks for the details. My hunch is that the problem may be located in the async task that is responsible for submitting the scan requests. Could you look into the worker logs if you find anything looking like an error using: docker compose logs worker

tdruez avatar May 16 '24 11:05 tdruez

@tdruez Unfortunately no errors are being reported. It looks like DejaCode thinks it has successfully submitted a job, but the ScanCode.io log does not indicate that it is receiving anything nor that it runs into errors.

Do you have any other ideas where I should look?

2024-05-17-dejacode-log-censored 2024-05-17-scancode-log-censored

rogu-beta avatar May 17 '24 07:05 rogu-beta

@ghsa-retrieval Thaks for the log, that's helpful. We can see that the task dje.tasks.scancodeio_submit_scan is properly called and executed but no URIs are provided:

INFO Entering scancodeio submit scan task with uris=[] ...

My guess is that none of your packages have a download_url defined. At the moment, a download URL is required to fetch and scan a package from DejaCode.

Some Download URL could be generated from Package URL using the purl2url library but only a few package types are supported.

As a side note, the UI should be improved to warn you about the lack of Dowload URL instead of displaying a success message.

tdruez avatar May 17 '24 07:05 tdruez

It seems that you're right, the imported packages from the SBOM only have the "Package URL" and "Inferred URL" populated, but not "Download URL". The SBOM that was uploaded has a purl and beneath properties a ResolvedURL. It's the same SBOMs as in https://github.com/nexB/scancode.io/issues/1230

[...]
"components": [
        {
            "group": "",
            "name": "bootstrap",
            "version": "5.3.3",
            "hashes": [
                {
                    "alg": "SHA-512",
                    "content": "f072c2756832a0c82e48ef68f9a1fe8ae67e6a1b7e9b35b4bb71c833356eed2aeba6fec4041c539eb165482b24c1d635f843854129bbb8c2613501e474f7268e"
                }
            ],
            "purl": "pkg:npm/[email protected]",
            "type": "library",
            "bom-ref": "pkg:npm/[email protected]",
            "evidence": {
                "identity": {
                    "field": "purl",
                    "confidence": 1,
                    "methods": [
                        {
                            "technique": "manifest-analysis",
                            "confidence": 1,
                            "value": "/builds/beta/dso/tests-and-demos/dejacode-transitive-test/package-lock.json"
                        }
                    ]
                }
            },
            "properties": [
                {
                    "name": "SrcFile",
                    "value": "/builds/beta/dso/tests-and-demos/dejacode-transitive-test/package-lock.json"
                },
                {
                    "name": "ResolvedUrl",
                    "value": "https://registry.npmjs.org/bootstrap/-/bootstrap-5.3.3.tgz"
                },
                {
                    "name": "LocalNodeModulesPath",
                    "value": "node_modules/bootstrap"
                }
            ]
        },
[...]

Shouldn't that be working though? Where does DejaCode expect the URL to come from?

rogu-beta avatar May 17 '24 07:05 rogu-beta

@ghsa-retrieval Unfortunately the CycloneDX does not include a clear field to store download URL for SBOM "components".

In ScanCode.io/DejaCode the download_url field is exported in the CycloneDX SBOM as aboutcode:download_url using custom properties defined at https://github.com/nexB/aboutcode-cyclonedx-taxonomy, see also https://github.com/CycloneDX/cyclonedx-property-taxonomy

cdxgen seems to be using the same properties approach with the ResolvedUrl property. I couldn't find much documentation about it on their repo though.

It would be interesting to have the list of properties generated by cdxgen to implement a mapping for importing those value during the CycloneDX ScanCode.io resolution.

tdruez avatar May 17 '24 08:05 tdruez

@tdruez There does not appear to be any documentation as far as I'm aware. The properties can be found in https://github.com/CycloneDX/cdxgen/blob/4a27933ee55914afecbd465ba4ca9a1da62a9cc1/utils.js#L818 being added through pkg.properties and apkg.properties.

Wouldn't it make more sense to derive the URL from the PURL though? I thought that was already uniquely identifying assuming that the PURL is for a package manager such as maven, npm, pypi and so on. That would be a general solution rather then trying to parse the custom properties of a particular SBOM generation tool.

Any solution is very much appreciated though!

rogu-beta avatar May 17 '24 09:05 rogu-beta

Wouldn't it make more sense to derive the URL from the PURL though?

Maybe, but in the context of loading an SBOM, generating data that is not present in the SBOM may not always be wanted. So kind of data integrity with the input is likely expected as the imported data. This will require more discussion though.

Any solution is very much appreciated though!

I think in the very short term, we can add support for the ResolvedUrl property.

tdruez avatar May 17 '24 11:05 tdruez

Maybe, but in the context of loading an SBOM, generating data that is not present in the SBOM may not always be wanted. So kind of data integrity with the input is likely expected as the imported data. This will require more discussion though.

That is a valid point. The suggested approach would ensure that only information already present in the SBOM would be used.

I think in the very short term, we can add support for the ResolvedUrl property.

That would be great!

rogu-beta avatar May 17 '24 11:05 rogu-beta

@ghsa-retrieval Support for ResolvedUrl property added on the ScanCode.io side in https://github.com/nexB/scancode.io/pull/1241

You can update your ScanCode.io instance (no changes on the DejaCode side) and try again the "Load Packages from SBOMs" + "Scan all packages of this product post-import"

Keep in mind that only the packages that end up with a value for the download_url field will be scanned.

tdruez avatar May 17 '24 12:05 tdruez

@tdruez Works like a charm.

rogu-beta avatar May 17 '24 15:05 rogu-beta

@ghsa-retrieval re:

Wouldn't it make more sense to derive the URL from the PURL though? I thought that was already uniquely identifying assuming that the PURL is for a package manager such as maven, npm, pypi and so on. That would be a general solution rather then trying to parse the custom properties of a particular SBOM generation tool.

There is code:

  • in the packageurl Python library to infer URLs from the PURL.
  • in the fetchcode library to also infer and validated URLs
  • in scancode-toolkit to infer URLs for a package https://github.com/nexB/scancode-toolkit/tree/develop/src/packagedcode
  • in PurlDB to mostly do the same given (or collect that from API calls)

So there are many ways and what we need likely here is likely an explicit action to call the PurlDB to "enrich" an SBOM with these URLs... or do this in ScanCode.io.... a little design needed. https://github.com/nexB/dejacode/issues/45

pombredanne avatar May 19 '24 21:05 pombredanne

@pombredanne that is what I suspected. From an outside perspective it would make sense to me if this feature would be in ScanCode.io, given that we already analyze the SBOM and try to do the same for underlying packages there.

rogu-beta avatar May 21 '24 06:05 rogu-beta

Note progress on deriving a download URL from a PURL when adding a package: https://github.com/nexB/dejacode/issues/131

DennisClark avatar Jun 12 '24 21:06 DennisClark

@DennisClark We are currently having trouble with putting DejaCode in production use, because the resolving of PURLs does not seem to happen for package created through an SBOM import unlike packages that have been added manually. Apologies if my requests get annoying, but this is currently quite a serious issue for us.

My current understanding is that ScanCode.io will not resolve PURLs to URLs, unless we encounter the very particular case where the additional property ResolvedUrl is found (https://github.com/aboutcode-org/scancode.io/blob/768c42877dba64032072a3e53ddc49b9df9e327a/scanpipe/pipes/cyclonedx.py#L101). Hence, if DejaCode triggers the analysis of an SBOM to gather packages and dependencies, all imported packages will not have a download_url assigned, unless ResolvedUrl is present. DejaCode does not attempt to resolve the PURL to an URL either, that code appears to be limited to packages added through views, if I'm understanding the code correctly. The function purl2url.get_download_url is only called here:

  • https://github.com/aboutcode-org/dejacode/blob/d4aa38356ab021f79a8b81935f00a686991726ae/component_catalog/views.py#L1827
  • https://github.com/aboutcode-org/dejacode/blob/d4aa38356ab021f79a8b81935f00a686991726ae/component_catalog/models.py#L2394

The consequence is, that SBOM imports with subsequent packages scans will always fail, as DejaCode cannot supply download_urls to ScanCode.io which are required for scan_single_package pipelines.

My questions therefore are:

  • Can you add the same resolution logic as you did for manually created packages for package created through a load_sbom pipeline in ScanCode.io?
  • If this is not possible, would DejaCode's scan of packages work properly if we would have a PurlDB instance running?
    • Would both DejaCode and ScanCode.io have to be connected to it?

rogu-beta avatar Mar 10 '25 09:03 rogu-beta

@tdruez If I'm not mistaken this is partially the underlying issue in https://github.com/aboutcode-org/dejacode/issues/258 as well, although purl2url would not resolve this as it currently does not support Maven packages (https://github.com/package-url/packageurl-python/issues/179). However, PurlDB might still be an option then.

rogu-beta avatar Mar 10 '25 10:03 rogu-beta