bazel
bazel copied to clipboard
Check that digesting consumes the expected number of bytes.
In Bazel, the time a file is stated to find its size can be very far from the time the file's digest is computed, which creates a window for shenanigans. Allow passing the expected size into Path.getDigest and raise an error there if the number of digested bytes doesn't match the previously seen size.
If file size differs, the digest must be different. Can you share more details about the motivation/background for this PR?
The potential problem is Bazel doesn't compute the file size by counting how many bytes are fed into the hash function. Rather, the file is stated.. Thus, there's a race between computing the size of the digest and computing the digest itself.
In other words, I'm trying to add a best-effort consistency check.
Internally, we embed the size into the digest (i.e. the file size can be derived from the digest) so I can understand the change. But I am not sure whether missing the file size could be a problem in practice.
Considering lots of effect will be needed to import this PR, I will let @meisterT make the decision.
If we decide to add this, we should consider merging it with --experimental_guard_against_concurrent_changes (which only applies when uploading action results to a remote/disk cache). In other words, I don't think we should have two separate mechanisms to detect unexpected changes to build outputs.
Steps can likely to taken to make this easier to merge if needed. Note I added a Path.getDigest overload taking a size; the no-argument version is preserved meaning no callers of Path.getDigest() will break. I did modify FileSystem.getDigest on the assumption that it is a less prominent interface than Path. If Google having many FileSystem implementations is a difficulty, I could similarly add an override for FileSystem.getDigest.
With regards to --experimental_guard_against_concurrent_changes (https://github.com/bazelbuild/bazel/issues/3360#issuecomment-1541332142), the problem is no one sets that flag. Unlike that flag, this change does not add any more I/O to Bazel; it's a simple internal comparison of two integers.
Internally, we embed the size into the digest (i.e. the file size can be derived from the digest) so I can understand the change.
Alternatively, would there be a way to align the hash functions available in Bazel with this construction?