bazel-remote disk cache: store a data integrity header for non-CAS blobs

The header is made up of three fields:

Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction.
Little-endian int64 (8 bytes) representing the number of bytes in the blob.
The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5).

Note that we currently only support SHA256, however.

This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data.

To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/.

We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (#67).

This also includes a slightly refactored version of PR #123 (load the items from disk concurrently) by @bdittmer.

Feb 10 '20 17:02 mostynb

@buchgr: any thoughts on this? In particular:

The file format (simple header, manually parsed, room to implement other digests).
The new directory structure (version number in the directory names, to make later changes easier).

Feb 26 '20 12:02 mostynb

The file format (simple header, manually parsed, room to implement other digests).

Is there a good reason to not use protobuf for the header? Did you consider reusing the Digest message from the REAPI?

The new directory structure (version number in the directory names, to make later changes easier).

I think it's a great idea. Thanks!

Feb 26 '20 14:02 buchgr

The file format (simple header, manually parsed, room to implement other digests).

Is there a good reason to not use protobuf for the header?

I didn't want to use a single protobuf for the entire file (header + blob) because it requires reading the entire file into memory in order to unmarshal, and that might be undesirable for large files.

Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.

Did you consider reusing the Digest message from the REAPI?

The Digest message alone is insufficient if we want to allow other hash functions in the future. But this would make sense if we figure out a good way to use protobuf for the header.

Feb 26 '20 15:02 mostynb

Using a protobuf just for the header with the blob concatenated seems a little complex (but maybe I just know too little about protobuf). We would need to read an unknown (but bounded) number of bytes of the file into a buffer in order to unmarshal the header- I'm unsure if it is safe to unmarshal protobuf messages with arbitrary bytes appended.

There is the delimited protobuf format that's recommended in the official docs [1] [2]. It's implemented in the Java protobuf library [3]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

[1] https://developers.google.com/protocol-buffers/docs/techniques#streaming [2] https://github.com/golang/protobuf/issues/779 [3] https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/Parser.html#parseDelimitedFrom-java.io.InputStream-

Feb 26 '20 15:02 buchgr

There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

Interesting, I'll try this out- thanks!

Feb 26 '20 15:02 mostynb

There is the delimited protobuf format that's supported at least by the Java implementation in protobuf and recommended in the official docs [1] [2]. It simply prepends the serialized proto with the number of bytes to read. Bazel uses it for the BEP.

Interesting, I'll try this out- thanks!

As an alternative to storing data as delimited Protobufs there is also riegeli [1], which allows storing metadata in the same file and doesn't require reading the whole file into memory. The kythe project seems to have a Go implementation available [2]. The nice thing about riegeli is that it can also automatically compress the data and save some disc space (although I'm not sure whether the overhead is worth it for non-CAS blobs, even with very fast compression modes like brotli or snappy).

[1] https://github.com/google/riegeli [2] https://pkg.go.dev/kythe.io/kythe/go/util/riegeli?tab=doc

Feb 26 '20 16:02 Yannic

Thanks @Yannic. The overhead of the header is negligible and there isn't a need for compression. I'd prefer to not pull in another dependency for the header.

Feb 26 '20 17:02 buchgr

I would also prefer not to pull in another dependency for what I hope can be a simple header.

But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / https://github.com/bazelbuild/bazel/issues/4575 ?

Feb 26 '20 21:02 mostynb

Writing the header proto size, then the message, then the content sounds pretty reasonable to me.

For inspiration, similar code is in: https://github.com/kythe/kythe/blob/master/kythe/go/platform/delimited/delimited.go

Feb 27 '20 16:02 pcj

But before ruling it out I wonder if riegeli could also give us the ability to compress the data blobs, re #12 / bazelbuild/bazel#4575 ?

The issue #12 talks about making compression part of the caching protocol. It was always my understanding that compression is most interesting for CAS blobs. We do not store protobufs in the CAS. For that zlib sounds more interesting for compression and I believe it is shipped as part of the Go standard library.

Does any of you feel differently?

Feb 28 '20 09:02 buchgr

Which cache operations should do the data verification? Put? Get? Contains?

Have you benchmarked how performance is affected? Especially for HTTP/1.1-without-SSL use cases, where golang’s io.Copy can use linux kernel’s sendfile(2) for transfeering data directly from file to socket, instead of copying data to and from user space. I think that is what today makes bazel-remote on par with heavily optimized http servers like NGINX.

May 08 '20 14:05 ulrfa

I haven't performed any benchmarks yet, and I haven't figured out when/where to perform the validation. But I suspect that we will either validate data on the first Get or Contains call, and cache the result until the file is rewritten (or the server restarts).

May 08 '20 15:05 mostynb

bazel-remote bazel-remote copied to clipboard

disk cache: store a data integrity header for non-CAS blobs

bazel-remote
bazel-remote copied to clipboard