specification icon indicating copy to clipboard operation
specification copied to clipboard

externalReferences type for "source" packages

Open gernot-h opened this issue 3 years ago • 2 comments

Sorry if I overlooked something obvious, but I miss a way to specify a source archive url for a component, as logical counterpart to the distribution type.

Many ecosystems have the concept of a source and a somehow derived package. In Python's PyPI you have a "wheel" and a "source" package (check https://pypi.org/project/chardet/#files), for Linux packages there are binary and corresponding source packages (check https://packages.debian.org/buster/libgcc1) etc.

Deriving the correct "source" package for a component isn't always straight-forward, but important for many use-cases (for example for license clearing, for mapping source-level sec advisories to binary components etc.). So it would be very helpful to store them in a CycloneDX BOM in a canonical way. Therefore I suggest to add a source type for externalReferences.

Note that this is in most cases not equal to the "vcs" type (which is often some kind of upstream project) because many repositories provide an own source archive exactly reflecting what was used when building their "binary" packages.

Example:

      "name": "chardet",
      "version": "4.0.0",
      "externalReferences": [
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/19/c7/fa589626997dd07bd87d9269342ccb74b1720384a4d739a1872bd84fbe68/chardet-4.0.0-py2.py3-none-any.whl",
          "comment": "PyPI wheel file"
        },
        {
          "type": "source",
          "url": "https://files.pythonhosted.org/packages/ee/2d/9cdc2b527e127b4c9db64b86647d567985940ac3698eeabc7ffaccb4ea61/chardet-4.0.0.tar.gz",
          "comment": "PyPI source archive"
        },
        {
          "type": "vcs",
          "url": "https://github.com/chardet/chardet",
          "comment": "upstream repository"
        }
      ]

gernot-h avatar Nov 03 '21 21:11 gernot-h

Distribution is intentionally not specific to binary, source, hybrid, or other. Multiple distributions can be specified for a component.

Take Maven for example. A single component may have multiple artifacts that are part of the distribution. https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.3.1/

In this case, there's artifacts for the:

  • binary
  • javadoc
  • sources
  • tests
  • test sources
  • pom

It's not the intent to describe every possible artifact type for every ecosystem. I think if we start separating out the types of distributions, we'll create confusion as not all ecosystems are black and white (source and binary).

For ecosystems where the component is the source (e.g. Perl), there would be confusion about which type to use as both distribution and source could be equally relevant. Javascript (npm) could actually be a hybrid containing both source and binary depending on the package.

In the Python example provided, it's easy enough to identify which distribution is the wheel and which one is not. In the Maven example, Maven has naming conventions so simple pattern matching against the distributions will tell you what they are. Other ecosystems may not be as predictable.

@coderpatros, @DarthHater what are your thoughts?

stevespringett avatar Nov 04 '21 03:11 stevespringett

Ah, I see, so for my example above I should just use this today:

      "name": "chardet",
      "version": "4.0.0",
      "externalReferences": [
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/19/c7/fa589626997dd07bd87d9269342ccb74b1720384a4d739a1872bd84fbe68/chardet-4.0.0-py2.py3-none-any.whl",
          "comment": "PyPI wheel file"
        },
        {
          "type": "distribution",
          "url": "https://files.pythonhosted.org/packages/ee/2d/9cdc2b527e127b4c9db64b86647d567985940ac3698eeabc7ffaccb4ea61/chardet-4.0.0.tar.gz",
          "comment": "PyPI source archive"
        },
        {
          "type": "vcs",
          "url": "https://github.com/chardet/chardet",
          "comment": "upstream repository"
        }
      ]

And it would be the task of the application to either do pattern matching in the URL to differentiate between package types or use other means like application specific comment conventions.

gernot-h avatar Nov 08 '21 09:11 gernot-h

@gernot-h is this still an open issue?

jkowalleck avatar May 30 '23 05:05 jkowalleck

@gernot-h is this still an open issue?

Thanks for asking! Yes, definitely. Within Siemens AG, we created a kind of downstream specification extending and narrowing down CycloneDX (parts of it are public in https://github.com/siemens/cyclonedx-property-taxonomy). As a workaround, we specify defined comment fields:

grafik

We would highly appreciate if there would be some interoperable upstream solution for it, so BOM scanners can be extended to provide this information over time.

We btw also had a discussion whether a 2nd purl entry for stating source references might be needed as source urls are never unambiguous, but for now, we don't think it's a good idea.

gernot-h avatar May 31 '23 10:05 gernot-h

That VCS reference could point to the general VersionControlSystem of the project, while source could point to the actual source used for generating the component, which is not necessarily hosted in a VCS and is not intended to be distributed. But then there already is the idea of source distribution, which is a specific type of distribution, one that is intended to be used downstream.

Why would it be necessary to document the source of a component, if it was not distributed from source in the first place? I still do not understand. How I see this: If you had a SBOM for product A which has a component B, and B was built/assembled/compiled/generated/packed from some source, then this B should provide a BOM for itself, describing the build process. Providing these capabilities is the goal of #31. There is no need for readers of A's BOM to know how(from which source) B was built or claimed to be built. All that is needed to know is where B came from(distribution) and have some hashes for integrity-checks on B.

jkowalleck avatar May 31 '23 11:05 jkowalleck

A VCS reference would not be sufficient even in cases where the source code is hosted in a public VCS, because we would want a reference to the sources for the particular version of the component, which is always a deep link. Example:

  • VCS: https://github.com/apache/commons-lang
  • Source reference: https://github.com/apache/commons-lang/archive/refs/tags/rel/commons-lang-3.12.0.zip

Determining this deep link to the correct sources can require specific knowledge of the source ecosystem. For example, it may be necessary to understand how Maven Central handles source archives, or what a Golang Proxy is.
Therefore, it would be great if the tool which has this knowledge (such as a CycloneDX scanner) could also record it in its output SBOM.

Currently, it can do so in an externalReferences section with type distribution:

"externalReferences": [
  {
    "type": "distribution",
    "url": "https://github.com/apache/commons-lang/archive/refs/tags/rel/commons-lang-3.12.0.zip",
    "comment": "source archive (download location)"
  }
]

While such an entry is correct, it is very difficult to consume. There can easily be multiple distribution entries - which one contains the source reference?
We currently work around this problem by using a defined comment string, but that is obviously a fragile construct which doesn't scale to partners and customers.

A type of source (or any other type which is clearly distinguished) would greatly improve our situation here.

tsjensen avatar Jun 27 '23 13:06 tsjensen

Looks like this topic was already picked up as proposed enhancement, but let me still try to answer the question.

Why would it be necessary to document the source of a component, if it was not distributed from source in the first place? I still do not understand. How I see this: If you had a SBOM for product A which has a component B, and B was built/assembled/compiled/generated/packed from some source, then this B should provide a BOM for itself, describing the build process. Providing these capabilities is the goal of #31. There is no need for readers of A's BOM to know how(from which source) B was built or claimed to be built. All that is needed to know is where B came from(distribution) and have some hashes for integrity-checks on B.

For our team, this is a compliance as well as maintenance topic. Think about providing a Linux firmware image with several hundred packages based on a certain Linux distribution. Or think about providing a vendored NPM/Ruby... bundle as part of an application download or product.

Now you need to not only provide a "binary" SBOM for your customer, but you also need to check the licenses of all the contained components internally. And you might want to also mirror a snapshot of the used source packages internally in case you need to patch your product/app in 5 years from now. For all these topics, we need our BOMs to describe the sources which were used by a 3rd party to provide the binary packages we used. (For well-designed eco systems like Python or Debian, the 3rd party provides this information, but all in different ways you want to import in a common format to a central place.) And we don't want to generate several hundred derived BOMs to describe how each of the integrated components was built.

I'm no security guy, but according to https://github.com/anchore/syft/issues/1700#issuecomment-1491967306, having the source information for a given "binary image BOM" is also valuable in vulnerability matching. That's why they invented their own proprietry extension to include this information adding custom purl qualifiers like we did specifying Siemens-wide CycloneDX comment strings used for source links.

We think this is relevant for many distribution use cases and we should have a common solution to express this information.

gernot-h avatar Jun 28 '23 08:06 gernot-h

Thank you very much for your insights. Thought about the topic a lot, lately. Here is what i came up with

Distribution not only have a URL, but have other attributes, too:

  • Kind: either "source" or "binary", where binary could be anything that is not source.
  • Format (tar, tar.gz, exe, rpm, deb, jar, war, phar, pkg, dmg, apk, wheel, egg, gem, nupkg, ...).
  • constraints
    • OperatingSystem (examples: RedHat, Debian-9, NixOS, OpenBSD, Windows-11, Windows-XP, macOS-13.4, iOS-11.2, Android-9, TempleOS, ...)
      • Maybe even version ranges for the operating system ...
    • ProcessorArchitecture (i86, amd64, arm, M1, M2, ...)
    • Runtime (python2, python3.10, node19-or-later, ruby-*, php8, java-11, DotNet3.1, ...)
      • Maybe even version ranges for the operating system ...
    • ... and more ...

There might be a lot of attributes related to a distribution, that might come in handy being documented. In case you are documenting distributions in a BOM, for me, it is most important to mark the one distribution that you actually used to build your product. I might not care about all the possible dists and sources, but I must know which one was actually used during build processes, so that I could reproduce and attest the build. Therefore, I would need a marker. (Would like to see an XML-constraint that allows only one of the distributions having this marker.)

Just some examples:

jkowalleck avatar Jun 28 '23 09:06 jkowalleck

Don't overthink it though. I would only need one extra item in the list of possible types. That list was already extended from 16 values in 1.4 to 39 values in 1.5. Let's make it 40 values in 1.6 by adding:

  • source = The URL of a source archive from which the component can be built

I don't need to know any additional details. (Of course, then I won't be able to actually build the component given only the SBOM, but frankly, that will be a problem no matter how much metadata you encode into the SBOM.)

tsjensen avatar Jun 29 '23 11:06 tsjensen

I'm with @tsjensen on this. The latest spec revision already gives people plenty of options to choose from for specialized types of references. But the one that we are still missing for our needs is the reference to source code.

For us it is critical to not only have the information which specific distribution of a component is in use in an application, but also to reference the source it was generated from. This provenance information allows us to conduct additional analysis. For the scope of this analysis we do not need to have all the information to reproducibly build an artifact from source, a reference to the source itself is sufficient.

To provide a simple example: For a component describing a maven package I would expect a "distribution" reference describing the maven repository layout the artifact came from and a "source" link that points to the GitHub release, VCS commit snapshot or any other deep link to the code the artifact was built from. With the current options for the reference type we have no option to clearly express both without resorting to comments.

agschrei avatar Jun 30 '23 12:06 agschrei

we discussed this topic in our last core working group meeting. It is still considered for 1.6. We might use an alternative wording. Something along "source-distribution". CC @stevespringett @coderpatros @DarthHater @CycloneDX/core-team // https://github.com/CycloneDX/specification/pull/269#issuecomment-1845834248

jkowalleck avatar Dec 09 '23 14:12 jkowalleck

fixed via #269

jkowalleck avatar Jan 12 '24 10:01 jkowalleck