spack
spack copied to clipboard
specs: include source provenance in `spec.json` and package hash
We've included a package hash in Spack since #7193 for CI, and we started using it on the spec in #28504. However, what goes into the package hash is opaque.
We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a package.py
file. Why?
- We want to be able to look up authoritative checksums, versions, etc. and pass them through scanners long after an installation takes place.
- Right now you can look at the version on a
Spec
, then look up the checksum for a version from thepackage.py
, but that's not reliable in the long term. We may remove versions frompackage.py
(in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time. - Only the version is stored explicitly on the spec; that's not sufficient to know exactly what hashes a spec was built with.
Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.
Here's what spec.json
looked like before:
{
"spec": {
"_meta": {
"version": 3
},
"nodes": [
{
"name": "zlib",
"version": "1.2.12",
...
"patches": [
"0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76"
],
"package_hash": "pthf7iophdyonixxeed7gyqiksopxeklzzjbxtjrw7nzlkcqleba====",
"hash": "ke4alug7ypoxp37jb6namwlxssmws4kp"
}
]
}
}
The package_hash
there is a hash of the concatenation of:
- A canonical hash of the
package.py
recipe, as implemented in #28156; -
sha256
's of patches applied to the spec; and - Archive
sha256
sums of archives or commits/revisions of repos used to build the spec.
There are some issues with this: patches are counted twice in this spec (in patches
and in the package_hash
), the hashes of sources used to build are conflated with the package.py
hash, and we don't actually include resources anywhere.
With this PR, I've expanded the package hash out in the spec.json
body. Here is the "same" spec with the new fields:
{
"spec": {
"_meta": {
"version": 3
},
"nodes": [
{
"name": "zlib",
"version": "1.2.12",
...
"package_hash": "6kkliqdv67ucuvfpfdwaacy5bz6s6en4",
"sources": [
{
"type": "archive",
"sha256": "91844808532e5ce316b3c010929493c0244f3d37593afd6de04f71821d5136d9"
}
],
"patches": [
"0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76"
],
"hash": "ts3gkpltbgzr5y6nrfy6rzwbjmkscein"
}
]
}
}
Now:
- Patches and archive hashes are no longer included in the
package_hash
; - Artifacts used in the build go in
sources
, and we tell you their checksum in thespec.json
; -
sources
will include resources for packages that have it; - Patches are the same as before -- but only represented once; and
- The
package_hash
is a base32-encodedsha1
, like other hashes in Spack, and it only tells you that thepackage.py
changed.
The behavior of the DAG hash (which includes the package_hash
) is basically the same as before, except now resources are included, and we can see differences in archives and resources directly in the spec.json
Note that we do not need to bump the spec meta version on this, as past versions of Spack can still read the new specs; they just will not notice the new fields (which is fine, since we currently do not do anything with them).
Among other things, this will more easily allow us to convert Spack specs to SBOM and track relevant security information (like sha256
's of archives). For example, we could do continuous scanning of a Spack installation based on these hashes, and if the sha256
's become associated with CVE's, we'll know we're affected. (@vsoch FYI)
- [x] Add a method,
spec_attrs()
toFetchStrategy
that can be used to describe a fetcher for aspec.json
. - [x] Simplify the way package_hash() is handled in Spack. Previously, it was handled as a special-case spec hash in
hash_types.py
, but it really doesn't belong there. Now, it's handled as part ofSpec._finalize_concretization()
andhash_types.py
is much simpler. - [x] Change
PackageBase.content_hash()
toPackageBase.artifact_hashes()
, and include more information about artifacts in it. - [x] Update package hash tests and make them check for artifact and resource hashes.
@tgamblin since you will be working on package hashes, can you also try fixing #30720 ?
Quick question: will this fix the following problem - if Spack has a recipe for package, whose sources are taken from Git (or any other version control, I think), and I have a local repository with the same package but with different git URL, Spack will blindly use sources from Spack mirror?
@spackbot run pipeline
I've started that pipeline for you!
Just pinging here to see what the status/blockers are for this MR to be merged. Having the capability to produce an SBOM is becoming highly desirable at LANL (perhaps there's another MR that does this?).
I had done https://github.com/spack/spack-sbom and a follow up PR here to add a command in 2021 (since closed) but you probably want the changes in first that @tgamblin thinks are important.
@scheibelp: RE:
This is trending toward listing out more information individually in the node dict:
Yes.
what is the point of that? I don’t perceive a benefit other than that it becomes more-readable; IMO there are already mechanisms for getting a better interface to this information, in particular reading it in as a Spec object.
There are not reliable interfaces for retrieving this from an old installation. package.py
files may drift; we need an authoritative place for this stuff. The Spec is where authoritative provenance goes.
I put some motivation in the description, but here it is for posterity:
We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a
package.py
file. Why?
- We want to be able to look up authoritative checksums, versions, etc. and pass them through scanners long after an installation takes place.
- Right now you can look at the version on a
Spec
, then look up the checksum for a version from thepackage.py
, but that's not reliable in the long term. We may remove versions frompackage.py
(in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time.- Only the version is stored explicitly on the spec; that's not sufficient to know exactly what hashes a spec was built with.
Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.
Does the launch of Protobom affect this PR in any way? Alternatively, should there be read-in converters for spec.yaml
and spack.lock
files in Protobom itself?
oh man, protobom! I love protocol buffers so this is :pinched_fingers: