syft icon indicating copy to clipboard operation
syft copied to clipboard

Merge FileDigests with ones from SBOM cataloger

Open dsseng opened this issue 6 months ago • 7 comments

What would you like to be added: When sbom-cataloger finds a new SBOM in the scanned artifacts, append its FileDigests content to the respective section of the SBOM being generated. Perhaps to avoid collisions paths should be composed of the source SBOM path and the file path itself.

Why is this needed: If a software unit is composed of multiple components with each tracking its dependencies and files separately, it would be nice to be able to see

Additional context: Alternative: publish SBOMs of the parts for the cases when one needs to know those files. Not very practical though.

Also I found out it is not possible to implement alternative file catalogers. Am I right that there is currently no better method other than adding keys to the sbomObject.Artifacts.FileDigests map after SBOM generation? For context: I have file hashes at hand, but not the files themselves during creation of this SBOM.

dsseng avatar Jun 13 '25 17:06 dsseng

Thanks for the issue @dsseng!

Let's start from the top and work our way down. For those coming to this issue the sbomObject.Artifacts.FileDigests can be found here: https://github.com/anchore/syft/blob/181e180284ea0ed2458e92f06279cd4184ce2053/syft/sbom/sbom.go#L15-L25

The first thing I want to highlight to orient this issue is where we support Decoding this information from the other SBOM formats:

CycloneDX decoding

https://github.com/anchore/syft/blob/0bfda2c514118073afc83aa57bce0c45deb06c4b/syft/format/internal/cyclonedxutil/helpers/decoder.go#L15-L38

Given the above code, when reading a cyclone dx document, it looks like we'd need to do some updates for writing to the .Artifacts.FileDigests field depending on the relevant data from the original cdx document.

SPDX decoding

https://github.com/anchore/syft/blob/0bfda2c514118073afc83aa57bce0c45deb06c4b/syft/format/common/spdxhelpers/to_syft_model.go#L32-L56

SPDX seems to be in the same state where we would need to decode from the document in the same way we encode files/digest here: https://github.com/anchore/syft/blob/0bfda2c514118073afc83aa57bce0c45deb06c4b/syft/format/common/spdxhelpers/to_format_model.go#L614-L620

The above affect the SBOM cataloger's ability to Decode the digest/file information from SPDX and cyclonedx to the internal SBOM object

The next thing I want to bring up is the representation of these files/digests in the larger document.

Similar issues that have touched on alternative file representations:

  • https://github.com/anchore/syft/issues/2211
  • https://github.com/anchore/syft/issues/3213

This problem squarely falls under the category outlined by @wagoodman in the above issues:

We have multiple issues that want to be able to search within a small space, but reference things outside of that space:

We are scanning a container image, but referencing file paths and information that are no longer within that space. They may be on the build server the detected SBOM came from, or some other environment that syft doesn't have access to beyond the detected document.

Some questions to answer here:

  • How are those unscanned paths(from the detected sbom in the sbom-catalgoer) going to be represented in the larger composite document after we merge the results found from the SBOM cataloger?
  • How are those unscanned paths different from paths that might be on disk, but not within the original scan target?
  • Is there a difference between a digest we computed and validated for some contents on disk vs some read from metadata?

A final question I have is what if the SBOM documents have different concepts of package vs file digest?

An example of this is where we surface package DB metadata that describes digests for installed packages for given package distributions. This is DIFFERENT from syft running the digest algorithm itself and cataloging the files.

The first exists on Package.Metadata the other is a computed value of Files cataloged.

Which sections of the cdx/spdx SBOM are you most interested in fitting into the .Artifacts.FileDigests section?

Some spare thoughts after we consider the above questions and orient around what we're trying to represent

The write to the FileDigests field is currently a result of the work done by this cataloger here (I truncated the catalog method in this issue): https://github.com/anchore/syft/blob/181e180284ea0ed2458e92f06279cd4184ce2053/syft/file/cataloger/filedigest/cataloger.go#L26-L52

There is a relevant v2.0 milestone for syft where we want to expand the different cataloger abilities to write to the final SBOM object. https://github.com/anchore/syft/issues/3263#issuecomment-2555812450

I mention the above because it should be shown how the signature for the package sbom cataloger is a little limiting in its current form: https://github.com/anchore/syft/blob/181e180284ea0ed2458e92f06279cd4184ce2053/syft/pkg/cataloger/sbom/cataloger.go#L40

We can only currently return ([]pkg.Package, []artifact.Relationship, error)

If we do any kind of work on this issue after discussing the previous points I highlighted the solution space would be within the above code.

Summary
  • we need to know where the data is that we want from the respective formats
  • we need to discuss how to represent the external files/digests (discovered from the sbom cataloger) separately from the ones syft can validate as a part of its scan target
  • we need to discuss how those files also differ from on disk files outside the scan target (https://github.com/anchore/syft/issues/3213)
  • we need to have encoding/decoding figured out where paths from -> to and to -> from are symmetrical for the FileDigests section between documnets

spiffcs avatar Jun 16 '25 17:06 spiffcs

How are those unscanned paths(from the detected sbom in the sbom-catalgoer) going to be represented in the larger composite document after we merge the results found from the SBOM cataloger?

Just move those records, potentially adjusting file locator to ensure no collisions

Is there a difference between a digest we computed and validated for some contents on disk vs some read from metadata?

Shouldn't be the case. My scenario: I download files (in the build process) by sha256 and sha512 hashes, and so when generating an SBOM for a package I just add those hashes in. This way I do not need to download and hash all the archives for SBOM generation tasks, since when building we always verify these hashes.

By the way, is using https URLs as locators a valid scenario for SPDX SBOM, as I want to keep the information where we got the file from.

A final question I have is what if the SBOM documents have different concepts of package vs file digest?

I assume I put the file (mostly .tar.gz source archives) hashes. For package there are some IDs, but packages are transferred into the main SBOM already, so that's out of scope.

I currently append files manually since I do not have source files at hand during SBOM build (or would rather not have as that's pointless storage access and hashing work):

	s.Artifacts.FileDigests[file.NewCoordinates("https://cdn.net/linux-6.14.0.tar.gz", "bldr sources")] = []file.Digest{
		{
			Algorithm: "sha256",
			Value:     "1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef",
		},
		{
			Algorithm: "sha512",
			Value:     "abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890",
		},
	}

This is applied in an example based off create_custom_sbom. And seems to work well, yet feels odd. A custom file cataloger could help us, since I'd be able to just return a list of "found" files I know of. But that's a rather niche use case, so consider that

dsseng avatar Jun 17 '25 09:06 dsseng

Also, well, I could consider a subset of this issue (get source info from child SPDX, modify locators, encode in parent SBOM) to be implemented downstream by wrapping Syft in some custom Go code, but that would probably be a cool feature for the upstream.

A small workflow clarification on how are all the sources considered. Maybe I'm doing SBOM wrong, would be grateful if anyone points that out:

  1. We got Pkgfile, containerd/pkg.yaml, containerd/patches/0001.patch, etc. These are the source files we have readable at the local filesystem
  2. Evaluating these by custom build system logic we can determine a source (GitHub archive URL) to be downloaded for the build. Let it be https://github.com/containerd/containerd/archive/refs/tags/v2.1.2.tar.gz. We also know its expected hashes (sha256 and sha512) by this time. When building a package we'd abort the build if those hashes mismatch, so we can assume those are the same as if we hashed the file we had locally.
  3. create an SBOM document for the package containerd: include a generic/ PURL, CPE from the package build instructions, version. Add sources (by hashing locally-accessible data): all the evaluated build system configs, patches. Append known URL+hashes of the source archive programmatically.
  4. Final build of the Talos root filesystem: got an SPDX SBOM for each package added (by copying from the images built previously, each with own SBOM), alongside go.mod for Talos itself. Run Syft with Go and SBOM catalogers.
  5. Got a full product SBOM, including both Go dependencies and what we used to compile system packages like containerd and Linux.

Notice: Talos has an immutable root filesystem and no package manager. We combine packages (and potentially extensions) when building a system image, and have no package metadata in the root filesystem.

Already working: packages appended to the SBOM, package SBOMs listed as files of the Talos SBOM. This is already great as we got a list of non-Go dependencies to scan for vulnerabilities in. TODO: make Talos SBOM also list e.g. containerd source archive as a file, to make it richer and more useful to cover the whole build process. Workaround: post SBOM for each package, you can later verify it got included alongside the go.mod by checking hashes.

dsseng avatar Jun 17 '25 09:06 dsseng

@dsseng we talked a bit about this on our livestream just now here.

I think the main sticking point right now is the collisions and how we want to represent files that come from outside of the scan source system.

We want to get the format of the file locator correct (if it even exists in the files section of the syft SBOM) so that it's not mixed with other scanned files within the source syft scanned.

Just move those records, potentially adjusting file locator to ensure no collisions

Here is a PR we discussed where we ran into a similar problem and tried to wrestle with/solve files coming from outside the source: https://github.com/anchore/syft/pull/2948

I'll update this comment when we have a more concrete and accepted proposal on how to represent these new nodes.

spiffcs avatar Jun 19 '25 19:06 spiffcs

Okay, thanks, I'll watch the replay in some days when I have time. Yes, this needs careful consideration, since Syft is generally a scanner that finds evidence of a package based on files, while we have data about the packages and want to generate SBOM from it, with extra complexity due to not using some distro's packages but rather building from source.

dsseng avatar Jun 19 '25 19:06 dsseng

Hi @dsseng, we talked about this issue on the livestream this week. I have a couple ideas but before I mention these I wanted to point out that a core philosophy of Syft is that it generally should be surfacing things it finds from scanning, so the files section has files that should correspond to logical files in the media scanned. If we include additional files, with hashes, there is an expectation these were present on the system but this doesn't end up being accurate, as @spiffcs noted above, where accurately representing these requires a bit more fundamental changes to the representation of locations.

However, I also wanted to acknowledge there are multiple use cases. One such example is to describe something like build configuration files in an SBOM, which then could be included in the final application SBOM, but currently would be dropped due to not preserving the files from the scanned SBOM.

In the future we may be able to find the SBOM and include it as it's own SBOM in SPDX 3, in which case the files would be associated with the embedded SBOM rather than the scan target. However, that's pretty far off since it requires both SPDX 3 and a change in the Syft data model to be able to handle embedded SBOMs, which probably isn't trivial to do.

Perhaps a stop-gap solution is to add a flag to opt-in to SBOM cataloger behavior to include the files it reads as if they were part of the normal set, without further qualification. Would this work for you or are there other issues I've missed?

kzantow avatar Jul 10 '25 20:07 kzantow

Perhaps a stop-gap solution is to add a flag to opt-in to SBOM cataloger behavior to include the files it reads as if they were part of the normal set, without further qualification. Would this work for you or are there other issues I've missed?

Yes, it'd be nice if whatever I implement for this purpose could be upstreamed behind a --i-am-really-sure flag. The described stop-gap is how I see it and I might implement this myself in some time, will PR once done

dsseng avatar Jul 10 '25 20:07 dsseng