bazel-remote
bazel-remote copied to clipboard
disk cache: store a data integrity header for non-CAS blobs
The header is made up of three fields:
- Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction.
- Little-endian int64 (8 bytes) representing the number of bytes in the blob.
- The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5).
Note that we currently only support SHA256, however.
This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data.
To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/.
We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (#67).
This also includes a slightly refactored version of PR #123 (load the items from disk concurrently) by @bdittmer.
@buchgr: any thoughts on this? In particular:
- The file format (simple header, manually parsed, room to implement other digests).
- The new directory structure (version number in the directory names, to make later changes easier).
The file format (simple header, manually parsed, room to implement other digests).
Is there a good reason to not use protobuf for the header? Did you consider reusing the Digest message from the REAPI?
The new directory structure (version number in the directory names, to make later changes easier).
I think it's a great idea. Thanks!
The file format (simple header, manually parsed, room to implement other digests).
Is there a good reason to not use protobuf for the header?
I didn't want to use a single protobuf for the entire file (header + blob) because it requires reading the entire file into memory in order to unmarshal, and that might be undesirable for large files.
Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.
Did you consider reusing the Digest message from the REAPI?
The Digest message alone is insufficient if we want to allow other hash functions in the future. But this would make sense if we figure out a good way to use protobuf for the header.
Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.
There is the delimited protobuf format that's recommended in the official docs [1] [2]. It's implemented in the Java protobuf library [3]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.
[1] https://developers.google.com/protocol-buffers/docs/techniques#streaming [2] https://github.com/golang/protobuf/issues/779 [3] https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/Parser.html#parseDelimitedFrom-java.io.InputStream-
There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.
Interesting, I'll try this out- thanks!
There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.
Interesting, I'll try this out- thanks!
As an alternative to storing data as delimited Protobufs there is also riegeli [1], which allows storing metadata in the same file and doesn't require reading the whole file into memory. The kythe project seems to have a Go implementation available [2]. The nice thing about riegeli is that it can also automatically compress the data and save some disc space (although I'm not sure whether the overhead is worth it for non-CAS blobs, even with very fast compression modes like brotli or snappy).
[1] https://github.com/google/riegeli [2] https://pkg.go.dev/kythe.io/kythe/go/util/riegeli?tab=doc
Thanks @Yannic. The overhead of the header is negligible and there isn't a need for compression. I'd prefer to not pull in another dependency for the header.
I would also prefer not to pull in another dependency for what I hope can be a simple header.
But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / https://github.com/bazelbuild/bazel/issues/4575 ?
Writing the header proto size, then the message, then the content sounds pretty reasonable to me.
For inspiration, similar code is in: https://github.com/kythe/kythe/blob/master/kythe/go/platform/delimited/delimited.go
But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ?
The issue #12 talks about making compression part of the caching protocol. It was always my understanding that compression is most interesting for CAS blobs. We do not store protobufs in the CAS. For that zlib sounds more interesting for compression and I believe it is shipped as part of the Go standard library.
Does any of you feel differently?
Which cache operations should do the data verification? Put? Get? Contains?
Have you benchmarked how performance is affected? Especially for HTTP/1.1-without-SSL use cases, where golang’s io.Copy can use linux kernel’s sendfile(2) for transfeering data directly from file to socket, instead of copying data to and from user space. I think that is what today makes bazel-remote on par with heavily optimized http servers like NGINX.
I haven't performed any benchmarks yet, and I haven't figured out when/where to perform the validation. But I suspect that we will either validate data on the first Get or Contains call, and cache the result until the file is rewritten (or the server restarts).