syft icon indicating copy to clipboard operation
syft copied to clipboard

Support SBOM query and exploration

Open wagoodman opened this issue 4 years ago • 4 comments

It would be interesting to add something that would allow you to answer simple questions about your SBOM document:

  • "how many packages does it contain?"
  • "are there any packages that contain 'libc' in the name?"
  • "does the given file hash exist in the SBOM?"
  • "are there any packages with zip files?"

Additionally it would be nice to also see basic summary information as well:

  • "list all of my packages"
  • "list all of my files"

Example CLI usage:

syft list packages ./sbom.json # list all packages
syft list files ./sbom.json          # list all files

syft query 'package where name == "libc"' ./sbom.json 
syft query 'package has file.name == "*.zip"' ./sbom.json 

syft query  ./sbom.json    # interactive prompt if nothing is given
>

Implementation question: Inventing a query language seems complex. Is there an existing one that we could leverage more easily?

wagoodman avatar Oct 16 '21 19:10 wagoodman

Interesting language candidates that we can embed:

  • https://github.com/cue-lang/cue
  • https://github.com/itchyny/gojq
  • https://github.com/google/go-jsonnet

jonasagx avatar Jan 27 '22 23:01 jonasagx

A good config language list to start, however, the embedded language feels secondary to what the kinds of questions we want to be able to answer when given an SBOM. I feel that the query language will/may "shake itself out" after getting a better handle on the class of questions first. For me this kind of feature is meant to separate "what an SBOM says" away from the format of the SBOM itself (SPDX, CycloneDX, etc).

Today when given an SBOM and you need to get answers to simple questions, the first step is to write a script. First step of writing that script is to parse the format (xml, json, ...) and traverse the specific ontology represented. Then you can gather the sufficient sub-set of facts from the document and arrange them in a useful data structure (e.g. map, set, etc) way to answer a question. We want to get away from an user needing to understand "format" distinctions at all --this is an implementation detail to the end user and should be abstracted away as much as possible (entirely?).

What kinds of questions might I write a script for?

  • What are all of the licenses of my direct dependencies?
  • What are all of the licenses being used in all of my application dependencies (transitively) ?
  • For a container SBOM, ignoring all packages not related to name="app-name", what are the licenses used for all transitive dependencies?
  • I know of a potentially bad dependency from the 5 o'clock news. Are we using it? if so, what dependencies are related to the offending dependency? Do we need to get rid of other libraries due to their use of this lib? (think log4j)
  • Do I have any dependencies that bring in an "iceberg" of other dependencies (e.g. bringing in one dependency imports all of the several AWS SDK client libs that exist)
  • What are my dev-dependencies? Ignore all of my non-dev dependencies.

I'm certain there are several questions like these that are very common. Focusing on these questions I feel would be a good starting point.

wagoodman avatar Feb 01 '22 19:02 wagoodman

We released a tool today to help answer most of these question, https://github.com/interlynk-io/sbomgr

riteshnoronha avatar Mar 21 '23 01:03 riteshnoronha

I've been thinking a bit about different ways we could help improve various use cases. During review of the documentation site, I see that we just tell users to use jq for a lot of tasks. And while this is certainly a common tool, and perfectly acceptable thing to do, there is gojq library (also mentioned above), which I have seen work quite well. If we were to integrate a filtering centralized spot after SBOM creation, before format output, we could potentially enable a lot of use cases that didn't need any external tools. In particular, it might be coupled with the existing templating engine to good effect.

Some handwavy examples:

Only output artifacts with names starting with c

syft alpine:latest --query '.artifacts |= map(select(.name | startswith("c")))'

With templates, however, it could provide a way to dramatically simplify writing the template by using jq for much of the logic, including deduplicating results via unique, etc., for example getting a list of package name to filename where primary evidence was found could end up being a little ugly, but the go template ends up being possible or even simple due to the inability to do complex things like deduplicate otherwise:

syft alpine:latest --query '[.artifacts[] | .name as $name | .locations[] | select(.annotations.evidence == "primary") | "\($name):\(.accessPath)"] | unique | map(split(":") | {key: .[0], value: .[1]}) | from_entries' -o template --template "Package,Location
{{- range $name, $location := .}}
{{$name}},{{$location}}
{{- end}}"

This, of course, would be dependent on template output using the output of the JQ processing, whereas filtering would have to just process more-or-less in place, probably with some syft -> json -> filter -> syft process that wouldn't necessarily apply to template output.

kzantow avatar Dec 03 '25 21:12 kzantow