remote-apis icon indicating copy to clipboard operation
remote-apis copied to clipboard

Compression support for inlined data (v2-compatible)

Open mostynb opened this issue 4 years ago • 5 comments
trafficstars

We added support for compressed blobs via the bytestream API a while back (#147), and this has been working well so far. Given that we don't seem to be gearing up for a non-backwards compatible REAPIv3 release, I think we should start investigating adding backwards-compatible support for compression of inlined data, aiming to minimize roundtrips while still using compressed data.

One idea that I have been thinking about is to create a new Blob message which has the same fields as Digest, but with two additional optional fields: bytes data to hold the inlined data and Compressor.Value compressor to describe the encoding. Then, we would replace each usage of Digest in a message that currently also has a separate field for inlined (uncompressed) data with a Blob field, and deprecate the previous data field.

For example:

message Blob {
  // The first two field types + numbers must match those in Digest
  string hash = 1;
  int64 size_bytes = 2; // This refers to the uncompressed size

  reserved 3, 4; // Leave some room in case we want to extend Digest later(?)

  bytes data = 5; // Possibly compressed data, if set
  Compressor.Value compressor = 6; // Encoding used, if the data field is set
}

For "request" messages that can be responded to with inlined data, we would add a repeated field that specifies the compression types that the client accepts. The server would pick one of those encodings (or identity/uncompressed) for each Blob that it chooses to inline data for. We would also add a new field in the Capabilities API for servers to advertise support for this.

I believe this would be backwards-compatible- old clients would not request compressed inlined data, and would receive Blobs that are decodable as Digests in response, and servers would have a clear signal for when this feature can be used in responses.

Is anyone else interested in this feature?

mostynb avatar Jul 12 '21 19:07 mostynb

Heya Mostyn,

We'd be interested in compressed inlined blobs too, yep! We have encountered some builds where it's relevant ourselves, with the batch APIs moving a non-trivial percent of the total bytes. (Builds particularly heavy in small files).

For implementation details, what's the benefit of introducing a Blob message to wrap these fields vs just adding compressor where necessary? Specifically, I could see three main variants:

  1. Just stick a compressor field beside bytes where relevant. (BatchUpdateBlobsRequest, BatchReadBlobsResponse, OutputFile, ActionResult x2 for std*_raw).
  2. Use Blob to wrap bytes and compressor, but not the Digest fields.
  3. Encapsulate Blob as a replacement for Digest, as you've described.

Looking at the API as it currently stands, I think I'd be in favour 1 for V2 - that way we don't introduce any deprecated fields or branch points, and "clean" support simply looks like supporting the compressor value for each bytes field, even if the opposite party will never set it to anything non-passthrough. The use of Capabilities + request fields should I think make it easy to do backwards compatibility safely (so no bytes fields risk being misinterpreted).

2 or 3 seems nicest for V3, so the API can clearly express that anywhere inlined bytes are present, they might be compressed. I guess 3 has the benefit of expressing an "optionally inlined blob" in a single field, rather than needing 2 for Digest + optional Blob? I could get behind that, though I think then I'd want to see Digest nested inside Blob instead of hash + size_bytes, so there are no places in the API a raw hash field may show up except the Digest message itself.

On Mon, Jul 12, 2021 at 3:50 PM Mostyn Bramley-Moore < @.***> wrote:

We added support for compressed blobs via the bytestream API a while back, and this has been working well so far. Given that we don't seem to be gearing up for a non-backwards compatible REAPIv3 release, I think we should start investigating adding backwards-compatible support for compression of inlined data, aiming to minimize roundtrips while still using compressed data.

One idea that I have been thinking about is to create a new Blob message which has the same fields as Digest, but with two additional optional fields: bytes data to hold the inlined data and Compressor.Value compressor to describe the encoding. Then, we would replace each usage of Digest in a message that currently also has a separate field for inlined (uncompressed) data with a Blob field, and deprecate the previous data field.

For example:

message Blob { // The first two field types + numbers must match those in Digest string hash = 1; int64 size_bytes = 2; // This refers to the uncompressed size

reserved 3, 4; // Leave some room in case we want to extend Digest later(?)

bytes data = 5; // Possibly compressed data, if set Compressor.Value compressor = 6; // Encoding used, if the data field is set }

For "request" messages that can be responded to with inlined data, we would add a repeated field that specifies the compression types that the client accepts. The server would pick one of those encodings (or identity/uncompressed) for each Blob that it chooses to inline data for. We would also add a new field in the Capabilities API for servers to advertise support for this.

I believe this would be backwards-compatible- old clients would not request compressed inlined data, and would receive Blobs that are decodable as Digests in response, and servers would have a clear signal for when this feature can be used in responses.

Is anyone else interested in this feature?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREW7CK3SSJRBE4H3OLIDTXNBPVANCNFSM5AHQBDYQ .

EricBurnett avatar Jul 12 '21 21:07 EricBurnett

I have mostly been considering this from the "cleaner v3 design, retrofitted to v2" angle (3), rather than the "minimal required changes to v2" point of view (1). (2) doesn't sound like a great compromise to me, but I'd need to read up more on protobufv3 to understand the implications well.

There are benefits to each of (1) and (3) course, and I'd be happy to sketch out solutions to both, or either one if there's a clear preference in the group.

mostynb avatar Jul 12 '21 22:07 mostynb

For the path to adding it to V2, I think (1) stands to be cleaner and not lean on reinterpreting messages as other messages, so I'd be interested in getting your take on it.

I think that can be independent of what exact pattern we'd pick for V3, so I'd suggest treating them as independent decisions. I don't feel terribly strongly about what pattern V3 *should *use, beyond being skeptical of inlining Digest fields into anything :).

No need to evaluate (2) if nothing points to it being best for either.

Thanks Mostyn!

On Mon, Jul 12, 2021 at 6:01 PM Mostyn Bramley-Moore < @.***> wrote:

I have mostly been considering this from the "cleaner v3 design, retrofitted to v2" angle (3), rather than the "minimal required changes to v2" point of view (1). (2) doesn't sound like a great compromise to me, but I'd need to read up more on protobufv3 to understand the implications well.

There are benefits to each of (1) and (2) course, and I'd be happy to sketch out solutions to both, or either one if there's a clear preference in the group.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/remote-apis/issues/201#issuecomment-878625829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABREW6JI3B4PPCOYDBF2PTTXNQZZANCNFSM5AHQBDYQ .

EricBurnett avatar Jul 12 '21 22:07 EricBurnett

We'd be interested in this too; we definitely observe a lot of builds with plenty of small but compressible artifacts (Go tends to be pretty heavy on this).

Agreed that both (1) and (3) have benefits; worth mentioning (2) for completeness but I agree that it's unlikely to be appealing as a solution. I think (1) is my favourite; (3) is a little weird from the perspective of the generated code given you have a thing that is like a Digest but is not actually one (so it'd be fine in languages like Python but not in say Go or Java).

I like the idea in general of encapsulating all blobs/digests with something like this Blob message as a V3 change; that seems more elegant than lots of individual sites re-inventing the same thing. But that doesn't seem practical for V2 and the embedded Request/Response messages seem like reasonable things to extend with the compressor field.

peterebden avatar Jul 13 '21 09:07 peterebden

Created a couple of draft PRs to gather opinions:

  • (1) https://github.com/bazelbuild/remote-apis/pull/202
  • (3) https://github.com/bazelbuild/remote-apis/pull/203

mostynb avatar Jul 13 '21 12:07 mostynb