slsa-verifier icon indicating copy to clipboard operation
slsa-verifier copied to clipboard

Support directory hashes

Open laurentsimon opened this issue 1 year ago • 12 comments

As part of the effort to bring SLSA to ML https://github.com/google/model-transparency, we need to be able to sign directories. This requires the definition of a new "hash", i.e. how to serialize a directory. We have a PoC for this in the repo linked above, and need to implement it in slsa-verifier

laurentsimon avatar Jan 08 '24 17:01 laurentsimon

/cc @mihaimaruseac @ramonpetgrave64

laurentsimon avatar Jan 08 '24 17:01 laurentsimon

@smeiklej

laurentsimon avatar Jan 08 '24 23:01 laurentsimon

jsonnet-bundler has a small utility method to generate the hash of a directory which might be useful here as well: https://github.com/jsonnet-bundler/jsonnet-bundler/blob/master/pkg/packages.go#L351

netomi avatar Jan 11 '24 19:01 netomi

this code is not safe from a cryptographic hash point of view, e.g. you can rename files to change their meaning. The hash we have in the model repo also handled parallel hashing using a tree. See comments in https://github.com/google/model-transparency/issues/49

laurentsimon avatar Jan 11 '24 19:01 laurentsimon

An even greater problem with the hash is that it lacks delimiters between files. So the two following directories will produce the same hashes: F1: "hello" F2: "world"

will produce the same hash has: F1: "hell" F2: "oworld"

laurentsimon avatar Jan 11 '24 19:01 laurentsimon

ok I did not realize that the directory hash should be also taking that into account.

Maybe tree hashes as calculated by git would be useful. Here is some test that I performed by creating a file with the same content but different filename in different directories and how the hash would be calculated by git.

If the filename is equal, the hash is the same, if the filename differs, also the hash differs.

tn@proteus:~/workspace/eclipse/EclipseFdn/tmp$ git ls-tree HEAD
040000 tree 1e6dbf97adb05c42dcb537cd717e368812dc23b5	test
040000 tree 844053933521d6c52f2f96e288dc9175a2e6aea0	test2
040000 tree 1e6dbf97adb05c42dcb537cd717e368812dc23b5	test3

tn@proteus:~/workspace/eclipse/EclipseFdn/tmp$ git ls-tree -r HEAD
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238	test/test.txt
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238	test2/test2.txt
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238	test3/test.txt

netomi avatar Jan 11 '24 20:01 netomi

This could work but forces existence of a .git directory and ties to git hashing algorithm.

mihaimaruseac avatar Jan 12 '24 00:01 mihaimaruseac

Sorry for the misunderstanding, I did not intend to suggest to use git itself, but rather its mechanism to generate tree hashes.

netomi avatar Jan 12 '24 08:01 netomi

Oh, fair point. Thanks for clarifications.

mihaimaruseac avatar Jan 12 '24 16:01 mihaimaruseac

Just adding to the conversation:

merkle trees seem like they could be a good way to hash directories, and someone has tried this in go.

re: your comments, I think we could add an aptional CLI switch to slsa-verifier like --enforce-subject-name-and-path, and then the if the slsa-github-generator doens't already, it could put the relative paths in the subject.name.

ramonpetgrave64 avatar Jan 16 '24 16:01 ramonpetgrave64

Thank you! We're now also experimenting with a manifest file instead of a hash of everything, but probably this won't work for SLSA (https://github.com/google/model-transparency/issues/111). Let's continue experimenting

mihaimaruseac avatar Jan 18 '24 17:01 mihaimaruseac

SLSA will replace the manifest format by a provenance format, the rest probably can remain the same

laurentsimon avatar Jan 18 '24 17:01 laurentsimon