Clarify PUT location with chunked upload
When uploading a blob in chunks, the spec says:
To upload a chunk, issue a PATCH request to a URL path in the following format, and with the following headers and body:
URL path:
<location>[...]
The
<location>refers to the URL obtained from the preceding POST request.
This implies to me that <location> can change with each successive POST, so there are potentially
n + 1 locations in play for n chunks (one for the initial POST and one for each PATCH).
The final PUT request is documented as follows:
To close the session, issue a PUT request to a url in the following format, and with the following headers (and optional body, depending on whether or not the final chunk was uploaded already via a PATCH request):
<location>?digest=<digest>
This doesn't make it clear which <location> should be used. Should it be the location returned by the most recent PATCH request, or the <location> returned by the original POST?
My understanding is the last <location> value is the one always used for the next request, and registries will reject the usage of older values. It's certainly worth clarifying.
That certainly seems to be the case in the registries I've tried. The other language that I believe could do with clarifying is this:
To get the current status after a 416 error, issue a GET request to a URL
The
refers to the URL obtained from any preceding POST or PATCH request.
That "any" sounds like it's OK to use the location from any of the sequence of previous POST or PATCH requests, but that appears not to be the case: when experimenting with the docker registry, it requires the most recent location.
There's actually a potential problem with that AFAICS: if a client is in the middle of a large upload and a network outage prevents a response reaching a client, the client might not be able to resume because they won't have the latest location value as expected by the server. I guess it's too late to change that now.
I wouldn't say things are too late to change. See #366 that was adding this recently. Chunked uploads aren't well supported, the docker engine uses them with a single large chunk when pushing images. I'm not aware of any client tooling that defaults to chunked uploads. So this is one of the safer areas to clarify without risking breaking existing use cases.
@sudo-bmitch One other possible change to make in that area would be to specify that the 416 response itself could contain information sufficient to make another correct PATCH request; for example, it could contain Location and Range headers like the GET response. That would avoid the need for the extra round trip in most cases, AFAICS.
Somewhat related, the Location set for chunked-uploads could be some remote store, eg: S3 (according to the text below):
Here,
is a pullable blob URL. This location does not necessarily have to be served by your registry, for example, in the case of a signed URL from some cloud storage provider that your registry generates.
But I'm wondering, how do we know where to send the final PUT to, to close the upload session with the Registry (so that it knows the upload is complete and can do a digest check)?
The spec also mentions:
A chunked blob upload is accomplished in three phases:
- Obtain a session ID (upload URL) (
POST)- Upload the chunks (
PATCH)- Close the session (
PUT)
However, 1. does not appear to give a "Session ID", but rather a Location to continue the upload with. So when using pre-signed URLs for the chunked uploads, there is no "session" to close.
Illustration
sequenceDiagram
participant Client
participant Registry
participant S3 as Object Store (S3)
Client ->> Registry: POST /v2/.../blobs/upload
Registry ->> S3: Request a pre-signed url
Registry ->> Client: 202 Accepted, Location: https://objects.example.com/foo?secret=bar
Client ->> S3: POST /foo?secret=bar
Client ->> Registry: PUT /v2/.../blobs/upload/SOMETHING?digest=baz
I'm happy to open a new issue if this is too unrelated to this issue.
@NickLarsenNZ are there existing registries that push (not pull) directly to an external S3 store today? If so, how does the registry verify the digest of the pushed blob? And does the S3 server implement the registry spec for required responses, chunked requests, etc?
are there existing registries that push (not pull) directly to an external S3 store today?
I have no idea, other than what I am implementing (and wanting to support pre-signed URLs).
If so, how does the registry verify the digest of the pushed blob?
I would assume the registry has access to the object store to verify digests. I am unsure yet if objects stores like S3 can provide blob digests, but if not then the registry would need to pull the blob to then check the digest for.
And does the S3 server implement the registry spec for required responses, chunked requests, etc?
I would expect eh object store (or whatever is handling the uploads) to not have to conform to this spec. That would be on the Client and Server.
However, I do see where that could become a problem, for example, if the object store requires POST instead of PATCH for example. The client (conforming to the spec) would have to make a PATCH request.
So I think one of these options need to be chosen:
- Update the spec to mention that the URL doesn't have to be for the registry, but it does need to conform to this spec to handle the
PATCHrequests.- There still needs to be clarity around what the final PUT should be. IMO, the spec should be providing a Session-ID, not just a
Locationfor a session ID to be extracted from.
- There still needs to be clarity around what the final PUT should be. IMO, the spec should be providing a Session-ID, not just a
- Remove the statement about the Location possibly being something that isn't the registry.
- Update the spec so that any kind of upload can happen in between the
POSTandPUT.- But that would require further specification so the client can know what is supported (eg: if the registry uses S3 signed URLs, the client needs to know what to do next). Maybe a hint can be given by the registry along with the Location header.
- Still, clarity around the final
PUTand what the Session-ID` would come from needs to be made.
I have updated my diagram to show the registry is requesting the pre-signed URL.
Redefining the spec to require new functionality for clients to work would break all existing clients and is something we try very hard to avoid. The spec is typically a trailing definition, documenting behaviors that have been shown to work by existing clients and servers. My focus in this issue is to understand how those existing clients and servers work so we can improve the spec language for new implementations to better work within the ecosystem.
Understood.
I'm interested to see if there is such an implementation in the wild that can use external storage.
On that note, is there any interest in using the learnings from the trailing-spec to begin working on a leading-spec (eg: v2)? I couldn't see any related issues, but would be interested in following along or taking part.