spdx-spec
spdx-spec copied to clipboard
Add SHA1GIT as optional file checksum
Transfered from bugzilla: https://bugs.linuxfoundation.org/show_bug.cgi?id=1356
Kate Stewart 2016-05-19 14:06:22 UTC see: http://lists.spdx.org/pipermail/spdx-tech/2016-May/003101.html
and from farther down the thread.
I see how making the SHA1 algorithm non-mandatory would be a breaking change, and that we'd like to avoid that. But maybe we could at least allow SHA1GIT as an additional algorithm and add it to the spec.
WRT the use-case you're asking for: It's all about performance. In our case scanners actually do scan Git checkouts most of the time, as dependencies (be it build time or runtime time) are usually included as Git submodules. When scanning these files, it does not make much sense to force the scanner to calculate the SHA1 on each file (in order to create valid SPDX) if the SHA1GIT is already known. However, I have to admit that getting the blob SHA1 for a given file name is a rather slow operation in Git, and for single small files (which is not uncommon for source code files) it might actually be faster to calculate the SHA1 instead of looking up the known SHA1GIT.
Finally, there's also the "reverse" use-case: Suppose you have an SPDX file with a bunch of File Checksums given, an you'd like to know which are the candidate Git commits these files can originate from. If only the SHA1s are given, you'd have to iterate over all eligible commits in you Git repositiory, checkout the files, and calculate the SHA1 on them to see whether there's a match. With the SHA1GIT on the other hand, you could directly search Git's object database to find the trees / commits that contain the given blobs.
I agree it probably is an edge-case, but maybe still enough reason to at least allow SHA1GIT as a File Checksum algorithm.
Regards, Sebastian
Bill Schineller 2016-05-24 17:19:48 UTC Decided not to change Spec version 2.1 with respect to mandatory SHA1.
Also at present not adding sha1git as a checksum type in Spec version 2.1
Changing the whole story around checksums is the type of thing we would consider discussing for an SPDX version 3.0.
(For now we hope to encourage consistent re-usable SPDX documents by sticking with our current approach of uniquely identifying each file by a SHA1)
Also note that Git is starting to put the pieces in place to transition away from SHA-1. So I'm in favor of registering SHA1GIT as an optional hash, with the understanding that in a year or so we're likely to also need a SHA256GIT or similar. And I'd like to keep the current SHA1 requirement for the backwards-compat reasons quoted above, although we'll probably want to also start requiring a stronger hash in parallel in the near future (possibly after Git settles on one?).
there are many SHA1 used by Git.... so sha1git means nothing of its own unless you say which one: I guess we are talking about a git blob checksum here, right?
Yes, as none of the other SHA1s makes sense to use here :wink: But we easily could (and probably should) clarify this by calling it SHA1GITBLOB. But I'm actually not so sure anymore how useful this is. For once, asking Git for the SHA1 of the blob for a given file is less performant than I thought. Secondly, at the latest if you need the package verification code you'd need to calculate the regular SHA1 anyway.
@sschuberth the main use case for a git blob sha1 is IMHO not so much verification but rather lookup and matching e.g. useful is you build a large index of these. So it has a use but then many other checksums have a use too. Having a simpler convention such as checksum:value (like in Docker sha256:abdc1234) would be a better and future proof way IMHO than to pile on more named attributes.
Secondly, at the latest if you need the package verification code you'd need to calculate the regular SHA1 anyway.
I think should be optional or just go away 🗡 and are a major PITA for implementors and a serious obstacle for adoption in general. Having checksums in the first place is plenty enough and even these should be optional too!
the main use case for a git blob sha1 is IMHO not so much verification but rather lookup
Exactly. Anyway I just did some rough performance comparison again. Running
git hash-object $file
(to calculate the "SHA1GITBLOB" for a file) is indeed slighly slower than running
git ls-tree HEAD $file | cut -f1 | cut -d' ' -f3
(to ask Git for the already computed "SHA1GITBLOB" of a file) as the file size grows. Also the latter is probably more efficient if you need to do this for many (small) files in the repository, as you could list blob hashes for whole trees at once then.
Having a simpler convention
Wording nit: That convention is more generic, not necessarily more simple (both from a human and parser perspective). In any case, I agree such a convention would be better as it's more future-proof.
I think should be optional or just go away
For quite some time I was thinking similarly about the package verification code. But just recently a good use-case occurred to me:
In the Open Source Review Toolkit we're caching scan results for known packages. At the example of a locally developed Maven package, its pom.xml might contain a version like 1.0.0-SNAPSHOT. The -SNAPSHOT will not go away until the final release, and meanwhile new local commits may be made (and pushed to the Git server). Meaning, there are likely many commits for which the Maven version is 1.0.0-SNAPSHOT. As a result, using the package version (in addition to the package name, namespace, etc.) for the key to the cache is not sufficient.
So what about using the current Git commit hash as part of the cache key then, either instead of or in addition to using the package version? That's slightly better, but still not good enough because:
- It does not account for locally modified files (unless you also take output from
git statusorgit describeinto account to detect dirty working trees). - It does not account for files that are untracked in Git but are picked up by the scanner.
- It does not account for uninitialized Git submodules (while your working tree might be clean, but the current commit's hash is the same whether submodules are checked out or not).
So what can we do instead? We need "something" that uniquely identifies the set of files that have been scanned, in order to use that "something" as the key to the cache of scan results. And that's just where the SPDX package verification code comes in handy.
re:
And that's just where the SPDX package verification code comes in handy.
Good point... I tend to prefer a Merkle tree top level hash for this, but the verif code serves the same purpose in a simpler way. And does not make any assumption about the files listed being a tree.
When I said this "should go away": I actually rather meant that it should not be mandatory in fact. I wrote/spoke too quickly there! 😊
On Fri, Feb 09, 2018 at 07:41:07AM +0000, Sebastian Schuberth wrote:
Yes, as none of the other SHA1s makes sense to use here :wink: But we easily could (and probably should) clarify this by calling it SHA1GITBLOB.
I don't think we need to clarify. Knowing SHA1GIT is all you need to use ‘git cat-file -p ${HASH}’. Git stores the object type internally.
On Fri, Feb 09, 2018 at 07:58:06AM +0000, Philippe Ombredanne wrote:
@sschuberth the main use case for a git blob sha1 is IMHO not so much verification but rather lookup…
I think this is the key. If you have access to the Git repo, and the file you're looking at doesn't match the recorded hash, you can use a SHA1GIT hash to recover the version that was originally reviewed. That makes it easier to see what's changed since the last review, which may make concluding the license expression for the new version simpler.
Having a simpler convention such as
checksum:value(like in Dockersha256:abdc1234) would be a better and future proof way IMHO than to pile on more named attributes.
Isn't that what we have, just with a space after the colon 1?
Secondly, at the latest if you need the package verification code you'd need to calculate the regular SHA1 anyway.
I'm fine with optional package-verification hashes. And for folks who want to use it, I don't think calculation cost should be a major concern. Are folks really CPU or I/O limited in their SPDX generation? I expect folks are much more likely to be reviewer-limited. Or license-matching limited (since matching licenses seems much more computationally intensive than hashing the file).
I don't think we need to clarify. Knowing SHA1GIT is all you need to use ‘git cat-file -p ${HASH}’. Git stores the object type internally.
I believe @pombredanne's point was that, in order to calculate SHA1GIT without using Git, you need to know it's for a "blob". See e.g. this implementation.
Isn't that what we have, just with a space after the colon [1]?
True, except that, well, SHA1GIT is not (yet) a valid choice here as what may come after FileChecksum: is strictly defined. I'd rather prefer a specification that says something like "After FileChecksum: comes a string denoting the algorithm, and the SPDX parser implementation may decide whether it recognizes that string or not". Modulo some examples for common hashes.
Are folks really CPU or I/O limited in their SPDX generation?
I expect to be, yes. That is, with a massive amount of SPDX documents we're looking at creating, every saving in computation helps.
On Fri, Feb 09, 2018 at 06:31:55PM +0000, Sebastian Schuberth wrote:
I don't think we need to clarify. Knowing SHA1GIT is all you need to use ‘git cat-file -p ${HASH}’. Git stores the object type internally.
I believe @pombredanne's point was that, in order to calculate SHA1GIT without using Git, you need to know it's for a "blob".
I'm fine following git-hash-file(1) and defaulting the type to blob, without having to spell it out with SHA1GITBLOB. As you point out in 1, no other type makes sense in a FileChecksum value.
Isn't that what we have, just with a space after the colon 1?
True, except that, well,
SHA1GITis not (yet) a valid choice here as what may come afterFileChecksum:is strictly defined. I'd rather prefer a specification that says something like "AfterFileChecksum:comes a string denoting the algorithm, and the SPDX parser implementation may decide whether it recognizes that string or not". Modulo some examples for common hashes.
Right. But I think the spec also needs to provide a registry for hash algorithms. For an example in a Docker/image context, see 2. As you suggest, authors and clients would be free to write and support other algorithms, but without a central registry (and required tool support for some subset of algorithms), interop becomes less predictable.
Are folks really CPU or I/O limited in their SPDX generation?
I expect to be, yes. That is, with a massive amount of SPDX documents we're looking at creating, every saving in computation helps.
If you write enough SPDX that you need the performance optimization of using the same hash for your files and your package verification code, then you can use the vanilla SHA1 algorithm. If you aren't hardware-limited and want the retrieve-by-hash benefit of SHA1GIT, you can use that. I think we should register SHA1GIT for author use and make it an author decision. With SHA1GIT optional for both authors and tools, there would be no cost for authors or tools that didn't explicitly opt into the new hash.
I think we should register SHA1GIT for author use and make it an author decision. With SHA1GIT optional for both authors and tools, there would be no cost for authors or tools that didn't explicitly opt into the new hash.
Agreed!