distribution-spec icon indicating copy to clipboard operation
distribution-spec copied to clipboard

Proposal: Deprecate single-request monolithic blob uploads before 1.0

Open jonjohnsonjr opened this issue 3 years ago • 11 comments

background

It is possible to upload a blob in a single POST, per the spec, called a monolithic upload.

It is also possible to upload a blob via POST/PUT, which is also called a monolithic upload.

I'm talking about the first one. I call this "single request monolithic upload", which may be abbreviated as "SRMU" in the rest of this proposal.

I've put specific, actionable proposals in bold.

I can send a PR that implements my proposal, but I'd like to solicit some feedback first to see if anyone feels strongly opposed.

proposal

  1. We should remove SRMUs from the spec. As with catalog, the lack of broad support and high cost make it unfit for the spec.
  2. We should clarify the ambiguity between 201 and 202 response codes, whether we deprecate it or not.

(Optional aside (3): We should have a deprecation document. At one point we deprecated catalog, but I can't seem to find reference to it anywhere after https://github.com/opencontainers/distribution-spec/pull/178. It can be helpful to clearly document the differences, as in https://github.com/opencontainers/image-spec/pull/817)

rationale

ambiguity

This API is not well defined, so it is unclear how clients should interpret the registry's response.

For cross-repo mounting, if the mount failed for any reason, the registry will return a 202 that indicates you should proceed with the upload as usual. If the mount succeeds, the registry will return a 201, and clients can skip the upload.

For SRMUs, we don't have such a distinction... from the current spec at HEAD:

Successful completion of the request MUST return either a 201 Created or a 202 Accepted

See https://github.com/opencontainers/distribution-spec/pull/230#discussion_r568762650, where a test enforcing the spec (202 is allowed) would pass for a registry even though it doesn't support it.

At the very least, we should update the spec to say that clients should interpret 201 as "upload succeeded" and 202 as "proceed with upload at Location"

This has the nice failure mode for registries that don't support SRMUs. If a registry doesn't bother to check the request body or ?digest querystring parameter of POST requests, they will likely just return a 202 with a Location, as per a normal chunked upload. Clients that see this should probably assume that the registry does not support SRMU in this case, and proceed with a chunked upload or POST/PUT monolithic upload.

lack of support

From https://github.com/opencontainers/distribution-spec/issues/211#issuecomment-726312919

The POST/PATCH/PUT chunked uploading strategy is supported by every registry I've encountered (probably because that's what docker did), so we've modified our clients to only use this flow. Monolithic support may have caught up in the meantime, but as of a couple years ago, there were several smaller registries that did not support monolithic blob uploads.

I believe containerd uses the POST/PUT monolithic upload strategy, so that is probably also widely supported (though not as widely as chunked).

"A single POST request" should possibly be removed

In the original spec, this method of blob upload was only barely mentioned as an optional thing for the blob upload initiation:

Optionally, if the digest parameter is present, the request body will be used to complete the upload in a single request.

I worry that not a lot of registries support this.

Deprecation should not imply that registries that currently support this stop, so any clients that rely on a specific registry implementing this should still just work. I know of a single client that actually uses this upload mechanism, so there is very low cost to deprecating it. It's also a very easy fix, client-side, if a registry decided to stop supporting it.

I've just tried to perform this against Docker Hub, and it doesn't work. The upload POST response has a 202 status with a Location header, and the subsequent manifest PUT fails with MANIFEST_BLOB_UNKNOWN, presumably because the SRMU failed.

GCR does support this, but I'd rather it not.

(I have not tried other registries, feel free to pile on with support or lack of support on your favorite registry.)

cost

Also from https://github.com/opencontainers/distribution-spec/issues/211#issuecomment-726312919

This is going to be the most expensive blob upload method to implement for registries because there is no opportunity for the registry to offload the byte ingestion to a separate service (via the Location header, as in the other methods).

I'd argue for removing it entirely, but would be happy if there were some caveats around its use. Derek mentioned maybe we could add something about "optimizations" like this, maybe in the FAQ from vsoch. There's no discussion in the spec about why you would choose one of these methods over another, which could go in that FAQ. No matter what we decide, this should definitely not be the first upload method that readers encounter.

For some more discussion of why the SRMU is bad, see this great post from backblaze.

Specifically relevant:

The interface to upload data into Amazon S3 is actually a bit simpler than Backblaze B2’s API. But it comes at a literal cost. It requires Amazon to have a massive and expensive choke point in their network: load balancers. When a customer tries to upload to S3, she is given a single upload URL to use. For instance, http://s3.amazonaws.com/. This is great for the customer as she can just start pushing data to the URL. But that requires Amazon to be able to take that data and then, in a second step behind the scenes, find available storage space and then push that data to that available location. The second step creates a choke point as it requires having high bandwidth load balancers. That, in turn, carries a significant customer implication; load balancers cost significant money.

This is analogous to what we have in the registry spec. The registry is responsible for handling the blob ingress and cannot easily hand off that traffic to another service via the Location header.

jonjohnsonjr avatar Feb 04 '21 20:02 jonjohnsonjr

@jdolitsky relevant to the PR

@dmcgowan may have more historical context for this

client authors: please let me know if you disagree with my assessment of SRMU use in the wild

registry operators: please let me know if you disagree with this, if you do or do not support SRMU, and if you would support deprecation of SRMU

jonjohnsonjr avatar Feb 04 '21 20:02 jonjohnsonjr

If majority of clients do not use it, I have no issue with removing

jdolitsky avatar Feb 04 '21 21:02 jdolitsky

Please keep in mind that for a on-premise use case it is not unusual to have (for example) a private Github instance, a private Jenkins instance, a private registry and worker nodes to roll out deployments, all "internal" to a k8s environment.

Thinking about this some more, if the use cases and tradeoffs are so different between cloud operators and on-premise, maybe it needs to be captured/reflected so in the dist-spec. Just a thought!

rchincha avatar Feb 08 '21 17:02 rchincha

@rchincha I'm not sure I follow -- can you explain how on-prem use cases relate to monolithic uploads?

jonjohnsonjr avatar Feb 08 '21 22:02 jonjohnsonjr

@rchincha I'm not sure I follow -- can you explain how on-prem use cases relate to monolithic uploads?

With regards to your earlier Backblaze comment why SRMU is bad. The main concern appears to be that without any way to indicate to the client as to where its (possibly large) upload can be routed, hurts the cloud operators to support the spec in its current form. How many customers actually do SRMU and why?

But in use cases where it could be supported, perhaps it provides a better UX (response times, single cURL request, etc) - building/pushing images frequently, which happens when you have a growing team of developers and projects, and just for this situation UX trumps infra costs (on-premise)

+1 for this version of the proposal

At the very least, we should update the spec to say that clients should interpret 201 as "upload succeeded" and 202 as "proceed with upload at Location"

rchincha avatar Feb 08 '21 23:02 rchincha

How many customers actually do SRMU and why?

I am only aware of one client that does this (by default). Not sure why, but I would guess because sending one HTTP request is easier than sending two 😄

perhaps it provides a better UX (response times, single cURL request, etc) - building/pushing images frequently, which happens when you have a growing team of developers and projects, and just for this situation UX trumps infra costs (on-premise)

Yeah, I think the curl example is the most compelling. Latency is a valid concern, but I think the redirect is usually a pretty small portion of the overall upload, and the trade-off for scalability seems pretty obvious most of the time. That's one reason I want to just move it to a "deprecated" place to discourage its use, rather than just delete it.

My hope is that by documenting that clients need to handle a 202, clients will be able to fall back to non-SRMU for registries that don't support it (many). For registries that do support it, the client should be happy, but for registries that don't support it, they can just continue through the normal path without any (additional) overhead.

jonjohnsonjr avatar Feb 08 '21 23:02 jonjohnsonjr

Oh and I forgot to respond to the original point re: on-prem...

The multi-request upload path allows for the registry to hand off ingress to a separate service, but doesn't require it. As a response to the POST that initiates a blob upload, the Location returned by the registry can just point back to itself, and you can easily handle both the uploads and the other registry APIs with a single service, so I don't think there's really an issue here beyond the ux/latency concerns from your second comment.

jonjohnsonjr avatar Feb 09 '21 18:02 jonjohnsonjr

Relavent conformance test: https://github.com/opencontainers/distribution-spec/blob/master/conformance/02_push_test.go#L63

jdolitsky avatar Feb 10 '21 22:02 jdolitsky

At the very least, we should update the spec to say that clients should interpret 201 as "upload succeeded" and 202 as "proceed with upload at Location"

This sounds like the best behavior to me

dmcgowan avatar Feb 10 '21 22:02 dmcgowan

dropping from milestone since #239 merged

jdolitsky avatar Mar 05 '21 23:03 jdolitsky

Two points, if you keep it in the spec I would strongly warn about using it for too large entities and also mandate that Expect: Continue should be sent/supported, especially if the entities are larger than a few mb.

ecki avatar Jun 05 '22 20:06 ecki