syft Encapsulate all artifacts in syft JSON output

Today we output a json structure similar to the following:

{
   artifacts: [
      # list of packages
   ],
   relationships: [
      # list of package relationships
   ]
   distro: {...},
   ...
}

As we move forward and want to capture more kinds of artifacts we could consider moving to something closer to this:

{
   artifacts: {
      packages: [
         # list of packages
      ],
      files: [
         # list of files
      ],
      distro: {...},
      # more artifact types...
   },
   relationships: [
      # list of relationships for ANY artifact
   ],
   ...
}

In this way we can agnostically express all artifacts kinds without filling the root-level object up with new elements.

Oct 16 '21 18:10 wagoodman

Adding a little bit from a discussion today: I wonder if this would be better in a DAG style format to enable more/different nodes and potentially make certain things more consistent and less breaking, possibly just having artifacts: [ { id: <id>, type: <some-type> } ] and relationships: [ { from: <id>, to: <id>, type: <type> } ] or so. (or even just artifacts: [ { id, type, relations: [ ... ] } ]) This wouldn't preclude using an internal data model that is more strongly typed. But might make external tools consuming these things a bit more able to handle changes generically. This might be a bad idea.

Oct 19 '21 21:10 kzantow

@kzantow , good ideas. Especially with the callout that internally we could stick to "strongly typed bins of nodes" while still outputting json with a single bin of all node types mixed. In this way we get the best of both worlds --internally we can keep to typed collections instead of interface{} and externally a single list makes graph-like operations more natural since there is only one list of nodes.

I think I'm preferable to the approach of a bin for edges and another bin for nodes (as opposed to nodes containing edge descriptions, which could get odd/redundant when describing two way relationships).

Oct 25 '21 18:10 wagoodman

Regarding the internal representation of an SBOM (not the external JSON representation): (...This entire comment is tangentially related to this issue but can also be implemented with #556 or #554.)

Today we do not have a single data structure that represents all of the things that will make up the essential definition of an SBOM. This has lead to some sprawl on most of the execution path... take for example the signature for encoding an SBOM:

func Encode(catalog *pkg.Catalog, metadata *source.Metadata, dist *distro.Distro, scope source.Scope, option format.Option) ([]byte, error) { ... }

The necessarily elements should really be encapsulated similar to so:

func Encode(s sbom.SBOM, option format.Option) ([]byte, error) { ... }

I've made a branch that makes these changes in an rough/exploratory capacity: https://github.com/anchore/syft/compare/single-sbom-document .

This branch settles on the following definition for the internal representation of an SBOM:

type SBOM struct {
	Artifacts      Artifacts
	Source source.Metadata
}

type Artifacts struct {
	PackageCatalog      *pkg.Catalog
	FileMetadata        map[source.Location]source.FileMetadata
	FileDigests         map[source.Location][]file.Digest
	FileClassifications map[source.Location][]file.Classification
	FileContents        map[source.Location]string
	Secrets             map[source.Location][]file.SearchResult
	Distro              *distro.Distro
}

(note: why isn't there any relationships in this internal representation? It will be added as a field under sbom.SBOM as part of #556 )

The Artifacts section would contain elements that were discovered during the cataloging phase and can be used. This could also be the data structure that is passed around in a task-based workflow in #554 . The organization of Artifacts is to make it easy for catalogers to map directly into this data structure without having to worry about locking / coordination; ideally a single cataloging task would write to a single field on the Artifact struct.

The remaining fields in the SBOM struct would be for capturing other nuance. If Artifacts answers "what was discovered" the Source should answer "what was scanned". There can be more sibling fields added to this SBOM struct in the future that answer tangential questions, such as "how were these things discovered" (attaching application config agnostically) or what are the explicit "known-unknowns" that occurred during scanning or modified the scope of what was scanned (e.g. permission error when accessing X path, IO error accessing Y device, configured to ignore select paths at input, etc... all of these "known-unknowns" can be further detailed under https://github.com/anchore/syft/issues/518). I think we can get a first pass with this data structure without exhaustively adding all of these things up front, and instead get these added iteratively.

There is one breaking change in this branch: scope has been removed from the minimal SBOM description. Since the application configuration is included in the SBOM scope is already included (since this is a configurable). Application configuration is included in order to aid in reproducibility of the SBOM and as a whole answer the question of "how" the SBOM was made. scope More clearly falls into this category (the "hows") and not as a minimal SBOM element (the "whats").

The syftjson.Format encoder can be responsible for taking the proposed internal shape and transforming it for external presentation as described here https://github.com/anchore/syft/issues/555#issuecomment-947119666 .

Open questions I have about this path:

should we include a normalization of the application configuration as an explicit item on the SBOM? Should it be left out? (note: this could come in a future increment)

Oct 26 '21 15:10 wagoodman

syft syft copied to clipboard

Encapsulate all artifacts in syft JSON output

syft
syft copied to clipboard