model-transparency
model-transparency copied to clipboard
Investigate different serialization of directories
We implemented a simple tree-like serialization routine in https://github.com/google/model-transparency/blob/main/model_signing/serialize.py. Other possibilities:
- [ ] go.sum uses dirhash https://github.com/golang/mod/blob/master/sumdb/dirhash/hash.go, which lists files instead.
sha256sum $(find . -type f | sort) | sha256sum - [ ] Sharding https://github.com/google/model-transparency/pull/24 to enable parallelization. This is similar to dirhash, but also uses a sharding size to parallelize computation. (It also reports empty directories, which can be removed if we decide too).
Here is a fully specified version of go's dirhash: https://github.com/in-toto/attestation/blob/main/spec/v1/digest_set.md#dirhash
Another alternative is the git tree hash: https://github.com/in-toto/attestation/blob/main/spec/v1/digest_set.md#gitcommit-gittree-gitblob-gittag. This is similar to the go dirHash except it includes Unix mode bits, which may or may not be desirable.
dirhash is slow, it cannot be parallelized. Currently we have a "sharded" implementation that allows some hashing to happen in parallel, see current PoC implementation https://github.com/google/model-transparency/blob/main/model_signing/serialize.py#L289 It's similar to dirhash but has sharding as well. We also account for empty folders, which dirhash does not record
https://github.com/google/model-transparency/blob/main/model_signing/serialize.py#L333 is similar to git tree I suppose
I think the construction of multiple files is independent of how you hash each individual file. The go and git ones specify SHA256 but you could replace them with a different file hash function that is parallel.
One parallel algorithm to consider is fs-verity, which is supported in the Linux kernel and also has userspace implementations.
I think the construction of multiple files is independent of how you hash each individual file. The go and git ones specify SHA256 but you could replace them with a different file hash function that is parallel.
+1
One parallel algorithm to consider is fs-verity, which is supported in the Linux kernel and also has userspace implementations.
Found an implementation https://android.googlesource.com/platform/external/fsverity-utils/+/refs/heads/main/lib/compute_digest.c
It's a bit more complicated than what we have implemented but should work. Merkle tree approach is very useful for cases where we want to be able to make changes to a small fast of a file fast (assuming the rest of the tree is already available). Maybe not a requirement for us.. but maybe it will become one later when doing some finetuning (?)
If folks can use a kernel implementation to speed things up without the need to context-switch into userspace, that's great too. (without the need to re-implement a new hash)
Note that the hash algorithm can be set in the signature file, so this gives us some flexibility to update the hash used if necessary over time. We use Sigstore bundle https://github.com/sigstore/protobuf-specs/blob/main/protos/sigstore_bundle.proto as wire format (as per sigstore-python library) The sigstore-python library hard-codes the hash algo here https://github.com/sigstore/sigstore-python/blob/main/sigstore/sign.py#L390
The API only supports bytes / content inputs. I created https://github.com/sigstore/sigstore-python/issues/666 a while back. Having an API that takes as input a hash+has_algo may be helpful.
/cc @woodruffw
There is also the NAR hash that Nix uses, see page 93 of https://edolstra.github.io/pubs/phd-thesis.pdf Though it doesn't look like it would bring improvements
/cc @haydentherapper @di
Closing this as with the manifest we no longer serialize a model directory to a single hash. The library supports converting a manifest to a single hash and serializing a directory directly to a hash, a la dirhash.