distribution-spec Add registry proxying section

Define repository namespace query parameter for proxying.

Closes #12

Giving time for registry operators to weigh in

Maintainer approval

[x] @dmcgowan
[ ] @jdolitsky
[x] @jzelinskie
[ ] @mikebrow
[x] @stevvooe
[ ] @vbatts

Jul 26 '19 00:07 dmcgowan

Why do clients need to know anything about pull-through caching if its implemented server-side?

Aug 09 '19 19:08 jzelinskie

Why do clients need to know anything about pull-through caching if its implemented server-side?

The clients should know how the registry host was resolved from a given image reference. The clients don't care how the server is implemented, but they SHOULD provide information to the server which indicates what the reference being asked for is. Just as when an HTTP client connects with a PROXY server it must communicate what the upstream server is, the same is true here. Today the protocol doesn't define anyway to communicate what the upstream is and proxies end up be hardcoded to a single upstream. In a few cases you can see proxies use custom domains per upstream and require users to change the name of their images in order to use them.

Aug 09 '19 20:08 dmcgowan

Today the protocol doesn't define anyway to communicate what the upstream is and proxies end up be hardcoded to a single upstream.

Right... isn't that the point? If I encode that "myregistry/mynamespace/myrepo" goes to "upstream/foo/bar", that's a detail for the maintainer of myrepo and one that the client, ideally, doesn't need to know; the whole point is the puller thinks they are getting "myregistry/mynamespace/myrepo".

If the goal is to allow the client to specify "upstream/foo/bar", then I'd that the target is not really a repository anymore, but simply a working proxy, and thus, a different protocol parameter might be useful, but registries should therefore have the option to not support said parameter.

Aug 12 '19 13:08 josephschorr

Right... isn't that the point?

That is one use case that will still work. In the example you mentioned, when a repository is proxied in that fashion, the puller often does know of this detail as they must explicitly provide myregistry with the intent of getting some upstream content. The use case where myregistry is some sort of blessed version of upstream is reasonable, but not the intent of the namespace parameter here.

If the goal is to allow the client to specify "upstream/foo/bar"

This is the use case here and proxy may be better terminology here, but that is really a detail of the registry. The registry may act as a proxy, proxy-cache, or active mirror, that is out of scope for definition here. This parameter just enables all of those features to work across multiple namespaces. For example if you want public images from both docker.io/* and quay.io/* to be cached in the same registry proxy today, you would need the server to have two hostnames (something like docker.io.myproxy and quay.io.myproxy) then have clients configured to do that mapping for each namespace. This simple query parameter provides a much simpler option to clients and servers. If a server does not support it, it ignores the parameter. If a client is configured to send all requests to a server which does not support it, that is not different than any other misconfiguration by clients today.

Aug 13 '19 18:08 dmcgowan

For example if you want public images from both docker.io/* and quay.io/* to be cached in the same registry proxy today, you would need the server to have two hostnames (something like docker.io.myproxy and quay.io.myproxy) then have clients configured to do that mapping for each namespace

Or configure two repositories, one for each? (especially since combining them could lead to merge conflicts).

I'm concerned we're adding quite a bit of complexity to address a use case that has simpler solutions when configured on the registry side.

Aug 13 '19 19:08 josephschorr

I'm concerned we're adding quite a bit of complexity to address a use case that has simpler solutions when configured on the registry side.

Can you elaborate here? Configuring a repository for each mirror is non-trivial. Configuring a domain for each upstream and routing to the upstream based on the domain is not easier, that would still requires the same routing on the server side that an implementation of this would require. The client side implementation to support per-registry configuration is not simple and inherently requires catch -all conditions when trying to enforce proxying through a gateway.

I did do a client side implementation of this to demonstrate the feature and allow server side implementations a client to test against. On the client side, it is not complex at all since clients should already know how to handle 404s when multiple registry endpoints are configured. On the server side, the complexity to support this isn't much more than existing proxy-cache support.

Aug 13 '19 20:08 dmcgowan

Configuring a repository for each mirror is non-trivial.

Its non-trivial, but its not that difficult either :)

My concern remains around complexity: the document as outlined, for example, says that the ns should not be sent to non-mirroring registries... but how does the client know that? Is it the registry's job to report back an error if that argument is found but unsupported? How will clients know to be able to check for this capability?

If we feel that pass-through proxying of other registries is, in and of itself, a feature of the protocol (rather than something configured on the registry side), then I suspect we need to give significantly more thought to the end-to-end user experience. For example, I could imagine some paths supporting proxying and others not.

Aug 13 '19 20:08 josephschorr

how does the client know that?

The clients have the most context and really does not need to be defined here, only that a client SHOULD make that distinction to avoid sending unnecessary redundant information. The clients themselves have both the configuration and endpoint resolution logic, so it has multiple options for determining this. In the implementation I sent I just simply did this by checking whether the endpoint was configured without push support, as this could indicate the registry being communicated with may not be the upstream source. However, I will probably add a check there for ns != host since there are never push configurations (such as with a Kubernetes runtime). Either way, this is trivial and not required.

Is it the registry's job to report back an error if that argument is found but unsupported?

No, the registry can simply ignore it. This is like asking a registry today which was configured to mirror docker.io to return an error if the client actually meant quay.io, the registry just isn't expected to have the same amount of context as a client in regards to the intent of the entire pull process. If the registry chooses to be handle the ns parameter and not support it, it is as easy as returning a 404 for unconfigured upstreams.

How will clients know to be able to check for this capability?

They aren't expected to check for it, but rather be explicitly configured for it. A client will know if it is configured to always use a specific mirror or a mirror for multiple namespaces. I think what you are suggesting here though is the idea of registry discovery. That is a much larger topic that I would still love to see happen, in that feature a client could start with zero knowledge (except of course the domain quay.io, docker.io, etc) and discover registry capabilities and endpoints.

Aug 13 '19 20:08 dmcgowan

discussion ensues on the call today. This sounds like a decent addition, but with a clear use-case for the behavior, and whether a registry implementation MUST support it.

Aug 14 '19 21:08 vbatts

pinging @thomasmckay and @kurtismullins who are implementing mirroring on Quay -- they probably have feedback and want to track this thread

Aug 14 '19 22:08 jzelinskie

Pulp container plugin team will want to keep an eye on this thread as well. Any feedback @ipanova @dkliban @asmacdo ?

Aug 15 '19 14:08 RCMariko

is ns already used, so best to continue with that mnemonic? or could it be something to not collide with the outdated concept that images would on be named "transport/namespace/name:tag"?

Aug 15 '19 14:08 vbatts

@vbatts I use ns or namespace here because namespace is a very generic concept. Certainly as a generic concept it has been used to mean many things. However generally namespace would refer to additional context (such as a prefix) on another name. In the distribution spec case, the name given to the registry would be the part on the URL path, the namespace just gives that additional context to the name. Existing distribution clients today parse the name as you described <sometransport/host/whatever>/<name given to the registry>, in which case when given just <name given to registry>, the <sometransport/host/whatever> would be the namespace of that. You could continue to divide those parts in smaller names in other namespaces, such as the Docker hub does with usernames/reponames, but that is out of scope here. Capturing this in elegant words is kind of tough, recommendations on which parts are unclear or how to make it better are appreciated.

Aug 19 '19 22:08 dmcgowan

@dmcgowan one thing i'm unclear on here is: can i have a single registry mirror that will be usable for more than one remote registry (i.e. remote of docker.io/..., quay.io/..., etc)

Dec 16 '19 17:12 vbatts

Notes from today's call:

(jz) quay does something different for mirroring. This should be called "proxying"
(dmcg) this is really for client side caching proxy, and is needed.

Please lets find a way to classify this language (whether client or server side). So we can close out or merge this

Apr 01 '20 21:04 vbatts

Reiterating our convo from the meeting:

I actually see a lot of value in adding this query parameter, but removing any connotation that it is the blessed solution for repository mirroring. I think that by including this value, a proxy could implement lots of different behavior for the client that need not be directly related to repository mirroring.

Apr 01 '20 22:04 jzelinskie

I note that harbor uses the term "replication" rather than proxy-cache or mirroring, which I quite like https://goharbor.io/docs/1.10/administration/configuring-replication/

Apr 09 '20 13:04 amouat

Updated section to remove use of the term "mirror". A mirror is just a special case of a proxy or could be a completely different configuration. For example, the most common "mirroring" case which is used by Docker is actually a pull-through cache. Using the more generic terminology here instead of continuing the confusion :)

Jun 26 '20 04:06 dmcgowan

LGTM

Jun 29 '20 20:06 jzelinskie

big time rebase needed @dmcgowan

Feb 17 '22 18:02 jdolitsky

Pushing this out to 1.2 while we figure out how to properly word this

Feb 17 '22 18:02 jdolitsky