spack icon indicating copy to clipboard operation
spack copied to clipboard

specs: include source provenance in `spec.json` and package hash

Open tgamblin opened this issue 2 years ago • 9 comments

We've included a package hash in Spack since #7193 for CI, and we started using it on the spec in #28504. However, what goes into the package hash is opaque.

We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a package.py file. Why?

  1. We want to be able to look up authoritative checksums, versions, etc. and pass them through scanners long after an installation takes place.
  2. Right now you can look at the version on a Spec, then look up the checksum for a version from the package.py, but that's not reliable in the long term. We may remove versions from package.py (in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time.
  3. Only the version is stored explicitly on the spec; that's not sufficient to know exactly what hashes a spec was built with.

Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.

Here's what spec.json looked like before:

{
  "spec": {
    "_meta": {
      "version": 3
    },
    "nodes": [
      {
        "name": "zlib",
        "version": "1.2.12",
        ...
        "patches": [
          "0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76"
        ],
        "package_hash": "pthf7iophdyonixxeed7gyqiksopxeklzzjbxtjrw7nzlkcqleba====",
        "hash": "ke4alug7ypoxp37jb6namwlxssmws4kp"
      }
    ]
  }
}

The package_hash there is a hash of the concatenation of:

  • A canonical hash of the package.py recipe, as implemented in #28156;
  • sha256's of patches applied to the spec; and
  • Archive sha256 sums of archives or commits/revisions of repos used to build the spec.

There are some issues with this: patches are counted twice in this spec (in patches and in the package_hash), the hashes of sources used to build are conflated with the package.py hash, and we don't actually include resources anywhere.

With this PR, I've expanded the package hash out in the spec.json body. Here is the "same" spec with the new fields:

{
  "spec": {
    "_meta": {
      "version": 3
    },
    "nodes": [
      {
        "name": "zlib",
        "version": "1.2.12",
        ...
        "package_hash": "6kkliqdv67ucuvfpfdwaacy5bz6s6en4",
        "sources": [
          {
            "type": "archive",
            "sha256": "91844808532e5ce316b3c010929493c0244f3d37593afd6de04f71821d5136d9"
          }
        ],
        "patches": [
          "0d38234384870bfd34dfcb738a9083952656f0c766a0f5990b1893076b084b76"
        ],
        "hash": "ts3gkpltbgzr5y6nrfy6rzwbjmkscein"
      }
    ]
  }
}

Now:

  • Patches and archive hashes are no longer included in the package_hash;
  • Artifacts used in the build go in sources, and we tell you their checksum in the spec.json;
  • sources will include resources for packages that have it;
  • Patches are the same as before -- but only represented once; and
  • The package_hash is a base32-encoded sha1, like other hashes in Spack, and it only tells you that the package.py changed.

The behavior of the DAG hash (which includes the package_hash) is basically the same as before, except now resources are included, and we can see differences in archives and resources directly in the spec.json

Note that we do not need to bump the spec meta version on this, as past versions of Spack can still read the new specs; they just will not notice the new fields (which is fine, since we currently do not do anything with them).

Among other things, this will more easily allow us to convert Spack specs to SBOM and track relevant security information (like sha256's of archives). For example, we could do continuous scanning of a Spack installation based on these hashes, and if the sha256's become associated with CVE's, we'll know we're affected. (@vsoch FYI)

  • [x] Add a method, spec_attrs() to FetchStrategy that can be used to describe a fetcher for a spec.json.
  • [x] Simplify the way package_hash() is handled in Spack. Previously, it was handled as a special-case spec hash in hash_types.py, but it really doesn't belong there. Now, it's handled as part of Spec._finalize_concretization() and hash_types.py is much simpler.
  • [x] Change PackageBase.content_hash() to PackageBase.artifact_hashes(), and include more information about artifacts in it.
  • [x] Update package hash tests and make them check for artifact and resource hashes.

tgamblin avatar Aug 22 '22 17:08 tgamblin

@tgamblin since you will be working on package hashes, can you also try fixing #30720 ?

iarspider avatar Aug 23 '22 20:08 iarspider

Quick question: will this fix the following problem - if Spack has a recipe for package, whose sources are taken from Git (or any other version control, I think), and I have a local repository with the same package but with different git URL, Spack will blindly use sources from Spack mirror?

iarspider avatar Oct 05 '22 16:10 iarspider

@spackbot run pipeline

tgamblin avatar Dec 04 '22 21:12 tgamblin

I've started that pipeline for you!

spackbot-app[bot] avatar Dec 04 '22 21:12 spackbot-app[bot]

Just pinging here to see what the status/blockers are for this MR to be merged. Having the capability to produce an SBOM is becoming highly desirable at LANL (perhaps there's another MR that does this?).

DarylGrunau avatar May 04 '23 19:05 DarylGrunau

I had done https://github.com/spack/spack-sbom and a follow up PR here to add a command in 2021 (since closed) but you probably want the changes in first that @tgamblin thinks are important.

vsoch avatar May 04 '23 19:05 vsoch

@scheibelp: RE:

This is trending toward listing out more information individually in the node dict:

Yes.

what is the point of that? I don’t perceive a benefit other than that it becomes more-readable; IMO there are already mechanisms for getting a better interface to this information, in particular reading it in as a Spec object.

There are not reliable interfaces for retrieving this from an old installation. package.py files may drift; we need an authoritative place for this stuff. The Spec is where authoritative provenance goes.

I put some motivation in the description, but here it is for posterity:

We want this information to be available from the concrete spec so that it can be retrieved as provenance without referring to a package.py file. Why?

  1. We want to be able to look up authoritative checksums, versions, etc. and pass them through scanners long after an installation takes place.
  2. Right now you can look at the version on a Spec, then look up the checksum for a version from the package.py, but that's not reliable in the long term. We may remove versions from package.py (in fact, we might want to as they become less relevant), and checksums for particular versions/releases may change over time.
  3. Only the version is stored explicitly on the spec; that's not sufficient to know exactly what hashes a spec was built with.

Adding hashes explicitly gets around this and gets us closer to having all the information we need for a complete SBOM.

tgamblin avatar May 22 '23 09:05 tgamblin

Does the launch of Protobom affect this PR in any way? Alternatively, should there be read-in converters for spec.yaml and spack.lock files in Protobom itself?

greenc-FNAL avatar Apr 19 '24 22:04 greenc-FNAL

oh man, protobom! I love protocol buffers so this is :pinched_fingers:

vsoch avatar Apr 19 '24 23:04 vsoch