remote-apis
remote-apis copied to clipboard
Compression support for inlined data (v2-compatible)
We added support for compressed blobs via the bytestream API a while back (#147), and this has been working well so far. Given that we don't seem to be gearing up for a non-backwards compatible REAPIv3 release, I think we should start investigating adding backwards-compatible support for compression of inlined data, aiming to minimize roundtrips while still using compressed data.
One idea that I have been thinking about is to create a new Blob message which has the same fields as Digest, but with two additional optional fields: bytes data to hold the inlined data and Compressor.Value compressor to describe the encoding. Then, we would replace each usage of Digest in a message that currently also has a separate field for inlined (uncompressed) data with a Blob field, and deprecate the previous data field.
For example:
message Blob {
// The first two field types + numbers must match those in Digest
string hash = 1;
int64 size_bytes = 2; // This refers to the uncompressed size
reserved 3, 4; // Leave some room in case we want to extend Digest later(?)
bytes data = 5; // Possibly compressed data, if set
Compressor.Value compressor = 6; // Encoding used, if the data field is set
}
For "request" messages that can be responded to with inlined data, we would add a repeated field that specifies the compression types that the client accepts. The server would pick one of those encodings (or identity/uncompressed) for each Blob that it chooses to inline data for. We would also add a new field in the Capabilities API for servers to advertise support for this.
I believe this would be backwards-compatible- old clients would not request compressed inlined data, and would receive Blobs that are decodable as Digests in response, and servers would have a clear signal for when this feature can be used in responses.
Is anyone else interested in this feature?
Heya Mostyn,
We'd be interested in compressed inlined blobs too, yep! We have encountered some builds where it's relevant ourselves, with the batch APIs moving a non-trivial percent of the total bytes. (Builds particularly heavy in small files).
For implementation details, what's the benefit of introducing a Blob message to wrap these fields vs just adding compressor where necessary? Specifically, I could see three main variants:
- Just stick a
compressorfield besidebyteswhere relevant. (BatchUpdateBlobsRequest, BatchReadBlobsResponse, OutputFile, ActionResult x2 for std*_raw). - Use
Blobto wrapbytesandcompressor, but not the Digest fields. - Encapsulate Blob as a replacement for Digest, as you've described.
Looking at the API as it currently stands, I think I'd be in favour 1 for
V2 - that way we don't introduce any deprecated fields or branch points,
and "clean" support simply looks like supporting the compressor value for
each bytes field, even if the opposite party will never set it to anything
non-passthrough. The use of Capabilities + request fields should I think
make it easy to do backwards compatibility safely (so no bytes fields
risk being misinterpreted).
2 or 3 seems nicest for V3, so the API can clearly express that anywhere
inlined bytes are present, they might be compressed. I guess 3 has the
benefit of expressing an "optionally inlined blob" in a single field,
rather than needing 2 for Digest + optional Blob? I could get behind that,
though I think then I'd want to see Digest nested inside Blob instead of
hash + size_bytes, so there are no places in the API a raw hash field
may show up except the Digest message itself.
On Mon, Jul 12, 2021 at 3:50 PM Mostyn Bramley-Moore < @.***> wrote:
We added support for compressed blobs via the bytestream API a while back, and this has been working well so far. Given that we don't seem to be gearing up for a non-backwards compatible REAPIv3 release, I think we should start investigating adding backwards-compatible support for compression of inlined data, aiming to minimize roundtrips while still using compressed data.
One idea that I have been thinking about is to create a new Blob message which has the same fields as Digest, but with two additional optional fields: bytes data to hold the inlined data and Compressor.Value compressor to describe the encoding. Then, we would replace each usage of Digest in a message that currently also has a separate field for inlined (uncompressed) data with a Blob field, and deprecate the previous data field.
For example:
message Blob { // The first two field types + numbers must match those in Digest string hash = 1; int64 size_bytes = 2; // This refers to the uncompressed size
reserved 3, 4; // Leave some room in case we want to extend Digest later(?)
bytes data = 5; // Possibly compressed data, if set Compressor.Value compressor = 6; // Encoding used, if the data field is set }
For "request" messages that can be responded to with inlined data, we would add a repeated field that specifies the compression types that the client accepts. The server would pick one of those encodings (or identity/uncompressed) for each Blob that it chooses to inline data for. We would also add a new field in the Capabilities API for servers to advertise support for this.
I believe this would be backwards-compatible- old clients would not request compressed inlined data, and would receive Blobs that are decodable as Digests in response, and servers would have a clear signal for when this feature can be used in responses.
Is anyone else interested in this feature?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREW7CK3SSJRBE4H3OLIDTXNBPVANCNFSM5AHQBDYQ .
I have mostly been considering this from the "cleaner v3 design, retrofitted to v2" angle (3), rather than the "minimal required changes to v2" point of view (1). (2) doesn't sound like a great compromise to me, but I'd need to read up more on protobufv3 to understand the implications well.
There are benefits to each of (1) and (3) course, and I'd be happy to sketch out solutions to both, or either one if there's a clear preference in the group.
For the path to adding it to V2, I think (1) stands to be cleaner and not lean on reinterpreting messages as other messages, so I'd be interested in getting your take on it.
I think that can be independent of what exact pattern we'd pick for V3, so I'd suggest treating them as independent decisions. I don't feel terribly strongly about what pattern V3 *should *use, beyond being skeptical of inlining Digest fields into anything :).
No need to evaluate (2) if nothing points to it being best for either.
Thanks Mostyn!
On Mon, Jul 12, 2021 at 6:01 PM Mostyn Bramley-Moore < @.***> wrote:
I have mostly been considering this from the "cleaner v3 design, retrofitted to v2" angle (3), rather than the "minimal required changes to v2" point of view (1). (2) doesn't sound like a great compromise to me, but I'd need to read up more on protobufv3 to understand the implications well.
There are benefits to each of (1) and (2) course, and I'd be happy to sketch out solutions to both, or either one if there's a clear preference in the group.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/201#issuecomment-878625829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREW6JI3B4PPCOYDBF2PTTXNQZZANCNFSM5AHQBDYQ .
We'd be interested in this too; we definitely observe a lot of builds with plenty of small but compressible artifacts (Go tends to be pretty heavy on this).
Agreed that both (1) and (3) have benefits; worth mentioning (2) for completeness but I agree that it's unlikely to be appealing as a solution. I think (1) is my favourite; (3) is a little weird from the perspective of the generated code given you have a thing that is like a Digest but is not actually one (so it'd be fine in languages like Python but not in say Go or Java).
I like the idea in general of encapsulating all blobs/digests with something like this Blob message as a V3 change; that seems more elegant than lots of individual sites re-inventing the same thing. But that doesn't seem practical for V2 and the embedded Request/Response messages seem like reasonable things to extend with the compressor field.
Created a couple of draft PRs to gather opinions:
- (1) https://github.com/bazelbuild/remote-apis/pull/202
- (3) https://github.com/bazelbuild/remote-apis/pull/203