slsa Better define "source"

Problem 1

Right now the term "source" is ambiguous and it is leading to confusion and disagreements, such as #127. Different people have different ideas in their head about what "source" means and how it controls on source can be implemented. In particular, I believe many readers consider it the primary "code" that is compiled, whereas @TomHennen and I had considered it the "build configuration" since there is no technical means to verify what was the actual "code." Once we clarify, I think it will be much easier to make decisions such as #127.

My proposal is to model a build as having only the following inputs (rough draft):

Build Interpreter: The software that interprets the build configuration.
- Examples: GitHub Actions (GHA), Google Cloud Build (GCB)
Build Configuration (Steps? Recipe?): The way the build is configured for this particular project, usually as a list of steps.
- Examples: GHA workflow definition at .github/workflows/build.yml within a particular repo, GCB steps defined in a trigger (not in version control).
Build Parameters: Additional user-provided inputs that affect the output of the build but that are not part of the configuration.
- Examples: GHA inputs for workflow_dispatch events, GCB substitutions.
- Note: If the parameter cannot affect the output, it is not listed. For example, an option that sets scheduling priority would not be considered an input to the build since it doesn't affect the output.
Top-Level Source: The primary input artifact to the build. This is usually a single source revision that "triggered" a CI build.
- Note: There is no technical means to ensure that this really was the source. In some cases, the top-level source only contains build configuration and the actual true "code" comes from a dependency and looks like a library. Examples: Chrome LUCI, curl Alpine package.
Build Environment: Software artifacts that are provided by the build system that the user has no control over. Ideally this is a single VM base image or container image, which may have its own provenance.
Dependencies: Software artifacts that were fetched during the build as requested by the build configuration, the top-level source, or another dependency.
- A build MAY further differentiate between libraries and build tools. This may help prioritize between different dependencies. However, there is no technical means to differentiate between these, so consumers SHOULD consider this "best effort."

Alternatives considered:

SPDX Relationships: I'm not sure they fit cleanly into our model.
@dlorenc had previously suggested a multi-layer model of builds in some GitHub issue, though I can't find it at the moment.

Once we decide this, we'll need to update the Provenance schema to match.

Problem 2

How to handle "source package/archive". See comment.

Aug 10 '21 13:08 MarkLodato

I am indeed one of the confused parties. This issue and #130 will go a long way to clearing things up, I think. What we are effectively saying, then, is that our source requirements apply only to the recipe (Because that is the only source the build system can automatically verify)?

The proposed model of a build maps well to my less cloudy mental model.

Aug 11 '21 09:08 joshuagl

What we are effectively saying, then, is that our source requirements apply only to the recipe (Because that is the only source the build system can automatically verify)?

That part is still to be decided. Based on the discussion in #127, it seems like there is a general desire to have some requirements on source. My guess at the current moment is that we'll land on stronger requirements for the recipe and some sort of "best effort" requirements on the source. But we'll see where the discussion takes us!

Aug 11 '21 12:08 MarkLodato

With https://github.com/slsa-framework/slsa/pull/141 now further clarifying the difference between "source" and "build instructions", we're now (still) in a spot where there are no requirements about the provenance for the "source" itself for an artifact until L4.

This is a bit of an extreme case, but you could imagine a repo setup like this:

The provenance contains a reference to the build config, which might not contain a real reference to the source code itself.

I think this wouldn't come into play until L4, where the build provenance is required to list all transitive dependencies.

Sep 02 '21 17:09 dlorenc

Right.

What do you think about:

At L3+(?) require that builders list all sources & dependencies they fetch for their users. Point out that for the case you describe source.zip wouldn't be included since the builder doesn't know about it (how could it?).

Once you hit L4 this naturally should list everything since now deps are declared up front and folks can't make network requests inside the build.

Sep 02 '21 17:09 TomHennen

My high level hand-wavy idea is that it would be really nice if we could require the provenance to contain a reference to the actual primary, direct source code used at earlier levels, maybe L2?

I realize it's a weak reference, without hermetic builds there's no guarantee that source didn't just pull some other source, which pulled some other source, etc, but just having the "best-effort" data there would be nice.

Then at L3, I'd like to add a similar "best-effort" requirement around direct dependencies.

Terminology is bound to be unclear here, but I think there's a reasonably consistent understanding of the difference between the source code of a binary itself, the direct dependencies, and the transitive dependencies.

So we could have a progression like:

Build Instructions
Source Code
Direct Dependencies
Transitive Dependencies/Hermetic

I don't know how we'd enforce this really though.

Sep 02 '21 17:09 dlorenc

Something we could do, which would be easy enough, is require that the builder contain a reference to source code from a source control system (meets the "L2 requires source control" requirement).

Many (most?) builders I've seen actually store the config-as-code instructions in the primary source control system anyways, so if you're using config-as-code you'd get 'direct' source code in most cases. If you're not using config-as-code then the builder needs to figure out what it should list. It could list whatever sources it fetches or it could just make the user tell it what source control system the source came from (impossible to verify, but maybe good enough for L2?).

By just requiring that some source be listed we'd have requirements that are easy enough to verify at the back end (the provenance either does or doesn't reference a source control system) without having to worry about splitting hairs on what 'direct' source is.

I think that meets the progression you've outlined. WDYT?

Sep 02 '21 18:09 TomHennen

Something we could do, which would be easy enough, is require that the builder contain a reference to source code from a source control system (meets the "L2 requires source control" requirement).

Nice!

I think that meets the progression you've outlined. WDYT?

Agreed. This sounds like a good approach.

Sep 03 '21 11:09 joshuagl

+1 there!

Sep 03 '21 11:09 dlorenc

#149 is an initial swing at this. If people like this language I could include something similar about dependencies.

But perhaps 'dependencies' could be a separate issue. I'm actually a bit worried about the "Transitive Dependencies" requirement. It could be taken to mean "include all the dependencies, and their dependencies, and their dependencies, ..." which would likely make the provenance prohibitively large and should really probably be covered by whatever we decide to do about transitivitiy. Perhaps "Transitive Dependencies" could be reworded to be a bit more precise?

Sep 03 '21 15:09 TomHennen

Absolutely agree transitive dependencies should be a separate issue, probably even in higher, as yet undefined SLSA levels

Sep 09 '21 18:09 trishankatdatadog

Another issue is how we handle source packages/archives.

Language distros, such as PyPI, distribute "source tarballs/zips" alongside the package.
Linux distros, such as Debian, distribute "source packages" along side the binary package.

In most cases, the workflow is version control --> [build source archive] --> source archive --> [build binary package] --> binary package. In the SLSA model, which is "source"? Do you need to trace back all the way to version control, or just to the source archive, which in turn would have its own SLSA level?

We currently say that "version control" is required at SLSA 2, but it's not clear what that means.

Feb 11 '22 14:02 MarkLodato

I think there are 3 principal classes of materials for a build:

The source code which is the input which is acted upon
The build script which is the input that acts upon the source
The dependencies which are essentially passive but are required for the build to complete

Identifying these as different classes of material in attestations would be useful

Sep 22 '22 20:09 shaunmlowry

In the above case, intermediate artifacts may be included as the source input to build steps in multi-step pipelines. There should be a mechanism to discover the original source having only the attestation for the final artifact, perhaps by using a bundle attestation to record a chain of custody

Sep 22 '22 20:09 shaunmlowry

But perhaps 'dependencies' could be a separate issue. I'm actually a bit worried about the "Transitive Dependencies" requirement. It could be taken to mean "include all the dependencies, and their dependencies, and their dependencies, ..." which would likely make the provenance prohibitively large and should really probably be covered by whatever we decide to do about transitivitiy. Perhaps "Transitive Dependencies" could be reworded to be a bit more precise?

I think the materials section should include every artifact in the build graph for the subject. Note that the attestation only makes claims about the SLSA level conformity for the subject, not the materials which should have their own attestations. That ensures that SLSA level need not be transitive whilst ensuring you can check the validity of every input that led to the subject, at least by checksum.

Sep 22 '22 20:09 shaunmlowry

Given the inability to discern between "source" and other materials like "dependencies" this complicates "dependencies complete" requirement too.

Sep 22 '22 21:09 mlieberman85

Given the inability to discern between "source" and other materials like "dependencies" this complicates "dependencies complete" requirement too.

Agreed. To achieve "dependencies complete" you have to list the entire build graph for the subject somewhere, and the logical place is the materials section of the attestation. To understand the action taken as described in the predicate it's helpful to be able to discern between the 3 types of input I listed above. In that way you can understand that the build script (predicate) was applied to the source code (source) in the presence of the dependencies (everything else in materials) to produce the artifact (subject)

Sep 22 '22 22:09 shaunmlowry

Discussed at the 2022-09-26 specification meeting. In short, I believe the consensus among the participants was basically the same as the top post, with the following additions:

Might also be valuable to differentiate between declared (or direct) dependencies and implicit (or transitive or indirect) dependencies, orthogonal to library vs build tool.

Still TBD on what level would require what.

Sep 26 '22 17:09 MarkLodato

Didn't want to lose track of this issue as well where we were trying to discuss build vs source requirements. https://github.com/slsa-framework/slsa/issues/463

Oct 31 '22 16:10 melba-lopez

slsa slsa copied to clipboard

Better define "source"

Problem 1

Problem 2

slsa
slsa copied to clipboard