gateway-api Splitting Experimental CRDs into separate API Group and Names

What type of PR is this? /kind feature

What this PR does / why we need it: This PR is a follow up to https://github.com/kubernetes-sigs/gateway-api/discussions/2844. As I've been considering how we'll handle the graduation of GRPCRoute, it's become clear to me that our current experimental and standard channel separation is flawed. This is an attempt to fix that.

Essentially the problem is that once someone chooses to install an experimental version of a CRD, they have no safe path to go back to standard channel. GRPCRoute did not cause this problem, but it did highlight it. Essentially, we'll need to include "v1alpha2" in our standard channel version of GRPCRoute simply to ensure that it can actually be installed in clusters that previously had GRPCRoute.

This PR proposes a big change. It moves all experimental channel CRDs to a separate API group gateway.networking.x-k8s.io and gives all resources an X prefix to denote their experimental status. This has the result of completely separating the resources. Practically that means that experimental and standard channel Gateways can coexist in the same cluster, but that the only possible migration path between channels involves recreating resources.

This would admittedly be annoying for controller authors, but I'm hoping only moderately. This approach relies on type aliases to minimize the friction. Here's what I'd expect most controllers to do:

Watch standard channel by default, provide an option to watch experimental channel resources
When watching experimental channel resources, funnel the results of both informers into shared logic (essentially everything from informer event handlers on down would be shared, would just need separate informers)
Develop an updated naming scheme for generated resources to ensure that resources generated for experimental GW API resources do not collide with resources generated for standard channel. (This new approach would mean you could have standard and experimental channel Gateways of the same name for example).

Take a look at hack/sample-client in this PR for an overly simple example of using experimental and standard channel types together.

All of this may sound like a huge pain, so why bother? I think this approach comes with some pretty important benefits:

Very clearly signals that "experimental" resources are experimental and by extension not meant to be trusted in production
Allows experimental and standard channel resources to coexist in the same cluster, allowing experimentation in one namespace and production ready standard channel usage in another
Avoids the possibility of someone getting stuck on experimental channel. (Before this PR, it was impossible to safely migrate from experimental channel to standard channel, so moving to experimental channel was a one way operation).

This PR is still very much a WIP, opening it early to get some feedback on the direction.

Does this PR introduce a user-facing change?:

Experimental channel CRDs have been moved to a separate API group and now have `x` as a prefix for kind and resource names.

Mar 30 '24 00:03 robscott

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

Mar 30 '24 00:03 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [robscott]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 30 '24 00:03 k8s-ci-robot

I'm not really a fan of a new experimental API group. It breaks client applications when something is promoted.

Mar 30 '24 18:03 dprotaso

Not to pile on, but I'm also a -1 on this; separate groups makes the resources in the channel separate resources for all intents and purposes. This feels like a big burden on controllers for not a lot of gain IMO. Why not include v1alpha2 as a server version while stripping out any alpha fields?

Mar 31 '24 16:03 keithmattix

It breaks client applications when something is promoted.

What would this break? Presumably most controllers would still be supporting both experimental and standard channel even with this change.

The solution to the "alpha in standard" channel seems clear to me - just remove alpha from the standard CRDs. The downsides of that approach are far more palatable and only impact extremely niche cases, while this causes widespread pain.

This feels like a big burden on controllers for not a lot of gain IMO. Why not include v1alpha2 as a server version while stripping out any alpha fields?

I completely agree that this solution would be entirely overkill if we were just trying to solve for how to graduate GRPCRoute to standard channel. Although that's definitely where this thought process started, I think there are much more compelling reasons in favor of an approach like what I've proposed here.

Specifically I think experimental channel as it stands today is a trap that more and more people are going to fall into unless we make some kind of change. Here are the problems with the model today:

There's no way to safely transition from experimental to standard. Let's say someone wanted to try out a new feature in experimental channel temporarily - now they're stuck on experimental channel in that cluster ~forever. The only truly safe path is to uninstall the experimental CRD and then install the standard one, but that's very disruptive.
There's no way to just try out an experimental CRD in some portion of your cluster. You either try it for everything in your cluster or nothing. Per point 1, if you do try an experimental CRD, it's essentially a one-way transition as it stands today, there's no safe path back to stability.
As we're seeing with GRPCRoute, standard channel CRDs have to adopt some characteristics of experimental channel CRDs, like providing alpha API versions, if we want it to be possible to install them in clusters that currently have experimental CRDs. This is the specific problem we're running into with GRPCRoute, but I think it's the least critical by far.
Some providers like GKE are have an option to install standard channel CRDs as part of cluster management. Based on discussions at KubeCon, it seems likely that this will start to extend to other cluster provisioners. It is ~impossible to install experimental channel CRDs in clusters with corresponding standard channel CRDs managed by the cluster provisioner.
Although our versioning model expressly allows for breaking changes in experimental channel, it's likely that they would be very disruptive because some have installed experimental channel CRDs without realizing it (same name, fields, etc) and could end up with things inexplicably breaking on them the next time they upgrade CRDs.

In my opinion, this leaves us with a couple options:

We can document all these problems and limitations of experimental channel. Unfortunately I think this would not be enough to keep people from getting burned on at least one of the problems described above. It would also likely significantly limit the usage of experimental channel CRDs. The success of Gateway API is entirely built on the idea that we can get feedback early via experimental channel, but that all goes away if no one uses it because it's so unsafe (see above).
We can move forward with an approach like I've proposed here, introducing stronger separation between experimental and standard channel CRDs and largely resolving all of the problems I've described above. Admittedly this does come with some additional work for controller authors, but I'm hopeful that would be limited to setting up an additional set of informers and all the code below that can stay the same.

Apr 01 '24 04:04 robscott

What would this break? Presumably most controllers would still be supporting both experimental and standard channel even with this change.

Clients that are authoring gateway resources (eg. Knative) that have typed clients would break. We wouldn't be able to work with both the standard channel and the experimental channel easily.

Apr 01 '24 13:04 dprotaso

Clients that are authoring gateway resources (eg. Knative) that have typed clients would break. We wouldn't be able to work with both the standard channel and the experimental channel easily.

Wouldn't you already have this issue? If a cluster only has standard channel CRDs installed and you try to install config that has experimental fields, won't that break? It seems like you'd already need to be aware of the channel of CRDs that is present when you're deciding what to configure. Or if everything fits in standard channel CRDs, just use those because they'll be far more stable and widely available.

Apr 01 '24 16:04 robscott

There's no way to safely transition from experimental to standard. Let's say someone wanted to try out a new feature in experimental channel temporarily - now they're stuck on experimental channel in that cluster ~forever. The only truly safe path is to uninstall the experimental CRD and then install the standard one, but that's very disruptive.

Install experimental new version with "v1+v1alpha1"
Storage version migrate everything to v1 (k8s will block you from skipping this step, so its not as error-prone as it seems)
Install standard (removes v1alpha1)

Seems safe to me?

There's no way to just try out an experimental CRD in some portion of your cluster.

I don't think this is a desired state. Nor common in other projects, including Kubernetes core.

As a controller implementation, I would certainly not allow this; if the experimental code is enabled in our central controller it impacts the entire cluster, not just some namespaces that are using the experimental ones. There is a shared fate in a shared controller.

Some providers like GKE are have an option to install standard channel CRDs as part of cluster management. Based on discussions at KubeCon, it seems likely that this will start to extend to other cluster provisioners. It is ~impossible to install experimental channel CRDs in clusters with corresponding standard channel CRDs managed by the cluster provisioner.

The same exists for "Alpha" API features in most Kubernetes providers. I don't see why we need new solutions here.

Apr 01 '24 16:04 howardjohn

Wouldn't you already have this issue? If a cluster only has standard channel CRDs installed and you try to install config that has experimental fields, won't that break? It seems like you'd already need to be aware of the channel of CRDs that is present when you're deciding what to configure. Or if everything fits in standard channel CRDs, just use those because they'll be far more stable and widely available.

It's more about the go types and client code. If I start using an experimental feature/CRD and then it's promoted to standard channel that's a breaking change for me to support.

Apr 01 '24 16:04 dprotaso

Install experimental new version with "v1+v1alpha1"

Storage version migrate everything to v1 (k8s will block you from skipping this step, so its not as error-prone as it seems)

Install standard (removes v1alpha1)

Seems safe to me?

This is true in the case of GRPCRoute, but not likely to be true in many other cases. For example, HTTPRoute will often have several different experimental fields, and only some of them will graduate to standard in a given release. Some may also have breaking changes along the way.

I don't think this is a desired state. Nor common in other projects, including Kubernetes core.

Disagree. Kubernetes upstream APIs have long had the problem that no one tests them while in alpha. Gateway API + CRDs were intended to be a way to get a shorter feedback loop on API design. Repeating the problematic patterns of upstream Kubernetes APIs is not desirable here IMO.

As a controller implementation, I would certainly not allow this; if the experimental code is enabled in our central controller it impacts the entire cluster, not just some namespaces that are using the experimental ones. There is a shared fate in a shared controller.

+1 completely agree, each controller should decide if it's going to support experimental resources or not. What we've found with Gateway API is that it's very common to have multiple implementations of the API running in the same cluster, and some may offer production readiness, while others may be more experimental in nature.

The same exists for "Alpha" API features in most Kubernetes providers. I don't see why we need new solutions here.

This has resulted in near-zero feedback for any Kubernetes alpha APIs which is very painful (coming from someone who's had to deal with this cycle multiple times). In Gateway API we have an opportunity to have a demonstrably better feedback loop, which I believe should lead to a demonstrably better API. If no one uses or implements experimental channel because it's either too unsafe or just impossible to access on any of the managed Kubernetes providers, we've just unnecessarily recreated the same problems that upstream Kubernetes APIs have.

Apr 01 '24 16:04 robscott

It's more about the go types and client code. If I start using an experimental feature/CRD and then it's promoted to standard channel that's a breaking change for me to support.

This proposal continues to use the same go types for both experimental and standard channel (just with type aliasing like we're already doing). The only thing you'd need to change is the API group you're point to, which I think should be relatively straightforward and also not that common of a transition. I'm assuming Knative already needs some kind of flag of whether or not to attempt to use experimental fields/CRDs, this seems like it would be a natural extension of that?

Apr 01 '24 17:04 robscott

This proposal continues to use the same go types for both experimental and standard channel (just with type aliasing like we're already doing). The only thing you'd need to change is the API group you're point to, which I think should be relatively straightforward and also not that common of a transition. I'm assuming Knative already needs some kind of flag of whether or not to attempt to use experimental fields/CRDs, this seems like it would be a natural extension of that?

IMO Its only simple because the example you showed only uses a simple List. Once you pull in real machinery like informers, controller-runtime, custom abstractions, etc it becomes far more complex.

This is speaking from experience when we implemented a "multi version" read support in Istio for Gateway API transition from alpha -> beta.

Apr 01 '24 17:04 howardjohn

IMO Its only simple because the example you showed only uses a simple List. Once you pull in real machinery like informers, controller-runtime, custom abstractions, etc it becomes far more complex.

That's fair, I'm curious if there are any shims or reference code that we could provide that would help here. My guess here is that the vast majority of controllers would need the following:

Event handlers for informers from both channels
Interface that could update status of resources from either channels

Is there anything else I'm missing here?

Apr 01 '24 19:04 robscott

Here is an example of us handling it: https://github.com/istio/istio/pull/41238/files. You'll note we had to duplicate some of our controllers entirely. This was only acceptable because it was short lived and caused by our own mistake in Istio rather than in the upstream API forcing it upon us.

If our concern is we will not get people trying out experimental features, I don't get how this helps. It requires both a user AND controller opt-into supporting it, and both are painful. I don't expect every controller to have tons of code to handle this or expect Helm charts to update to have if gateway.experimental.enabled { ... }

Apr 01 '24 20:04 howardjohn

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 02 '24 22:04 k8s-ci-robot

I managed to originally write the following comment on #2919. 🤦‍♂️ I'm repeating it here, along with @robscott's responses. Sorry for the confusion! (@robscott, please sing out if you think I'm misrepresenting you here.)

@robscott, ultimately I think we're falling a bit into the how-before-what trap here -- I kind of feel like we're wrangling about how to do things without having a clear sense of exactly what we need to support. Could we back up a moment and lay out some use cases here?

from @robscott: Agreed, in my opinion, we're trying to accomplish the following goals:

Optimize for stability in standard channel. This means avoiding exposure of alpha API versions in standard channel to avoid future painful deprecations or long term support of alpha.

Do what we can to enable users that have tried an experimental channel CRD to have an upgrade path to standard channel. ...[T]his is actually rather difficult with our current model...

A few that come immediately to my mind:

The experimental channel has GRPCRoute v1alpha2 which should be promoted to v1. Ana has been installing the experimental channel in her clusters, since she's developing against GRPCRoute; she would rather use the standard channel instead, though. What should she do?

from @robscott: A. Upgrade to experimental v1.1 CRDs that have both v1alpha2 and v1 API versions B. Upgrade to controller that supports v1 C. Upgrade to standard channel CRDs

Same situation as (1), but Ian has been managing the CRDs for Ana. How does the migration happen smoothly?

from @robscott: Same as above, just need to make sure the above order of operations happen, doesn't matter how much time passes in between each step though.

Ana has been using the experimental channel's support for the TeapotResponse stanza in HTTPRoute v1alpha7. TeapotResponse is being promoted to standard channel, and Ana wants to stop installing the experimental channel CRDs. How does TeapotResponse get moved to standard exactly? What does Ana need to do? (Assume that we currently have HTTPRoute v1 as standard.)

from @robscott: This gets to the root of the problem... There's probably no safe path from experimental to standard in this scenario. HTTPRoute in particular usually has several experimental things at once, and not all of them will graduate at the same time. If you were to upgrade from experimental to standard channel, you'd almost certainly end up losing fields and data from experimental channel with unpredictable outcomes.

Same situation as (3), but again, Ian is managing the CRDs instead of Ana having control over this.

from @robscott: Still awful unless we split CRDs... The current state is that there's no safe path from experimental to standard. GRPCRoute is the exception because the entire resource is graduating and it has been unchanged for many releases.

Shortly before TeapotResponse was proposed, the Bertrand Gateway controller wrote GEP-BERTRAND proposing RussellsTeapot with the semantics that eventually became TeapotResponse. GEP-BERTRAND was accepted into the experimental channel, but now that TeapotResponse is the accepted standard, the Bertrand folks need to cope with the fact that they have users of RussellsTeapot that need to be migrated to TeapotResponse. What should they do, exactly? Assume that they're catching this while TeapotResponse is still experimental, and that, again, we currently have HTTPRoute v1 as standard.

from @robscott: I think unfortunately you're stuck supporting both APIs for a long time, depending on the support guarantees of your implementation.

Same situation as (5), but suppose that the Bertrand folks wait until TeapotResponse has been promoted to standard.

from @robscott: I think the same largely applies.

(1) is, I think, what we've been discussing with GRPCRoute. (3) is, I think, a hypothetical that @robscott proposed, with concrete names and versions so we can talk about concrete solutions. (5) is a thing that happened to me recently with a Linkerd-specific CRD, and is likely about to happen to Envoy Gateway with TLS validation. (Thankfully, the LInkerd situation happened before it was released to the world, though after I started working with it for demos.)

What other situations come to mind?

Apr 03 '24 17:04 kflynn

The only truly safe path is to uninstall the experimental CRD and then install the standard one, but that's very disruptive.

The introduction of yet another version is also disruptive. I'm still not clear on how adding another version solves the problem of disruption for GRPC or other cases.

I don't know the historical reasoning behind why the CRDs are not vendored, like other APIs we consume. To me that would be an alternate solution, and we haven't talked about it.

Apr 03 '24 21:04 candita

I will pile on to what appears to be the broad consensus of all comments: I completely agree with all that another 'experimental' CRD is harmful - however for the same reasons, the current experimental CRD model is even more harmful and broken.

Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace. And each vendor does have 'beta' or 'public preview' or 'GA' labeling for each API they define.

It seems there is also agreement in this thread that this project ( gateway-api ) should not define some other space for new APIs. I completely agree - it will lead to confusion and attempts to define APIs in a void, without an implementation.

The only thing missing is a space ( or spaces ) where different vendors can collaborate on a common API ( after they have their own implementation ), build interoperability tests - similar to IETF - before that API can be proposed for merger into this repository and part of the core.

Of course that depends on sets of vendors or other orgs doing this - it could be a WASM-oriented repo driving a common WASM API, telemetry repo defining telemetry APIs. In the absence of such collaboration or orgs - it is also possible to continue the current process of collecting 'prior art' - in the form of existing vendor extensions - and define the common API based on that. Which is the process we already used for HttpRoute itself ( VirtualService, Ingress and serveral other vendor-specific APIs were considered).

TL;DR: it seems we all agree - in words and actions - with what I consider the spirit of the proposal, which is to have new CRDs and features implemented in separate API Group and using different Names.

Apr 12 '24 18:04 costinm

I would note that for each API defined by a vendor or independent organization - the status ( GA or private preview or whatever the vendors use for their feature stability definition - doesn't have to be a version ) is associated with a specific feature. If a vendor defines a OTel API as v1 and marks it stable - it is certainly not an 'experimental' API, just a single-vendor API. Users can safely use the API - as well with similar stable APIs from other vendors - with the deprecation policy and guarantees of each vendor.

The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on.

The process is very similar to IETF model.

Apr 12 '24 18:04 costinm

I'm hopeful that with Storage Version Migrator moving in-tree in Kubernetes 1.30 with KEP-4192, we may have a tool to help with this workflow, but (from my experience testing the out-of-tree impl) the behavior is too global/automatic by default (and therefore scary!)

I hope we may be able to provide a "safe" upgrade path with a bit of custom tooling using preflight checks in gwctl, similar to the approach @howardjohn suggested in https://github.com/kubernetes-sigs/gateway-api/pull/2912#issuecomment-2030128977:

Install Experimental channel new version with v1alpha stored and v1alpha,v1 served versions.
Check if any CRDs, fields (or enum values? what else?) in use are missing from the Standard channel. Attempting to overwrite CRDs missing in-use stored versions will block the "missing CRDs entirely bit", but we can still maybe handle this UX a bit nicer and earlier. We can't just compare against newer served versions because that wouldn't cover post-v1 changes like adding fields to HTTPRoute in the Experimental channel (I think the example @kflynn gave with a v1alpha7 HTTPRoute is not how we intend to make post-v1 backwards-compatible additions? Please LMK if I'm mistaken though.) This check may be difficult (and I don't want to manually maintain the logic), but I'm curious if we could do this with sufficient investment in some code generation. This is somewhat similar to the approach proposed in KEP-2558: Publish versioning information in OpenAPI except that we have the benefit of already having parsable channel flag comments.
Warn user with sufficient detail.
If no warnings found (or y/N override passed?), find Gateway API group CRs with a newer available served version (for initial promotion from v1alpha to v1 use cases), create SVM migration and watch for completion.
Report successful migration, provide instructions to move to Standard channel.
Install Standard channel CRDs, removing v1alpha versions.

Notably, this may not handle breaking changes between e.g. v1alpha1 and v1alpha2 in the Experimental channel if we take the same approach as we are with BackendTLSPolicy, but I think that's okay?

There's no way to just try out an experimental CRD in some portion of your cluster. You either try it for everything in your cluster or nothing.

I feel like this is more of a nice-to-have than a requirement - cloud providers have enabled such a proliferation of clusters that spinning up a new cluster with Experimental channel CRDs and redirecting some traffic to it doesn't seem too unreasonable. I view this primarily as an at-scale use case where a platform team would be managing shared app dev team access to clusters and could take on this story, not a must-have workflow for small self-serve teams. This is a pattern that wouldn't be bad to nudge users toward for other changes too, like migrating to a newer Kuberentes version instead of upgrading in-place.

As we're seeing with GRPCRoute, standard channel CRDs have to adopt some characteristics of experimental channel CRDs, like providing alpha API versions, if we want it to be possible to install them in clusters that currently have experimental CRDs.

For an initial v1alpha -> v1 promotion, I think simply serving v1 versions in the Experimental channel and providing some well-lit path for upgrading stored versions might be sufficient instead? (I'm not quite clear on how including v1alpha CRDs in Standard but not serving or storing them as GRPCRoute may do works with "automatic" translation as described in https://gateway-api.sigs.k8s.io/guides/crd-management/#api-version-removal and if that process changes with SVM in-tree.) Post-v1 though, I don't think even serving alpha versions would safely allow migrating from Experimental CRDs testing a new field back to Standard though...

It is ~impossible to install experimental channel CRDs in clusters with corresponding standard channel CRDs managed by the cluster provisioner.

On AKS we're evaluating an approach to allow users to "opt-out" of Gateway API management by the cluster provisioner - we'll install Standard channel CRDs by default when needed, but we want to let users "offboard" if they need functionality only available in the Experimental channel. Providing a safe path back to a managed Standard channel is a challenge currently though. Additionally, I would like to make it easier to install Experimental channel CRDs more granularly, such "Standard channel for everything except Experimental channel HTTPRoute".

Although our versioning model expressly allows for breaking changes in experimental channel, it's likely that they would be very disruptive because some have installed experimental channel CRDs without realizing it (same name, fields, etc) and could end up with things inexplicably breaking on them the next time they upgrade CRDs.

I think the way we're choosing to handle this with BackendTLSPolicy is reasonable - some pain is okay if it's not a surprise, and it's not possible to accidentally break in-use resources.

Apr 26 '24 19:04 mikemorris

Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace.

We have seen some of this historically, but from conversations I've had with maintainers this seems to largely be a pattern which Gateway API implementations hope to move away from to avoid end-user confusion, particularly for incremental changes to existing CRDs. For the well-defined extension points in Gateway API (filters, policies), this is a viable path though.

The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on.

I think this is precisely the stage we're trying to better define here. I do expect we'll still see some new CRDs emerge from existing vendor-specific implementations (authorization policy as a prominent example), but we're largely trying to focus on the "mid-tier" with Gateway API - the path for moving from experimental shared APIs for common functionality (after being proven in vendor-specific implementations) to a standard.

Apr 26 '24 19:04 mikemorris

Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace.

We have seen some of this historically, but from conversations I've had with maintainers this seems to largely be a pattern which Gateway API implementations hope to move away from to avoid end-user confusion, particularly for incremental changes to existing CRDs. For the well-defined extension points in Gateway API (filters, policies), this is a viable path though.

I'm sure all implementations would like their specific features to be added to the existing CRDs directly - instead of having to do harder work of going through the stages of experiment, multiple implementations, proof it works.

We hope and want for a lot of things - some are feasible, others are not, and what is nice for specific implementations is certainly not so nice for the users who have to deal with divergence between implementations and can't rely on any portable interface.

The criteria of having 2-3 implementations for a core API fail to take into account the reality of long-term supported releases, typical upgrade cycles - and multiple implementations used in the same cluster.

In any case - this proposal is orthogonal to this - if consensus exists to add a field to a core API directly as stable, with no experiment or proof - it should be added with whatever process is defined.

If a feature does not have consensus on moving directly to stable - it will still need a mechanism for experimentation and for implementations to prove the viability, users to provide feedback, etc.

As a user, I would prefer APIs that have been proven and vetted over APIs that are directly pushed to stable - even if that makes things a bit harder for implementations.

The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on.

I think this is precisely the stage we're trying to better define here. I do expect we'll still see some new CRDs emerge from existing vendor-specific implementations (authorization policy as a prominent example), but we're largely trying to focus on the "mid-tier" with Gateway API - the path for moving from experimental shared APIs for common functionality (after being proven in vendor-specific implementations) to a standard.

That's very simple - if I understand the proposal correctly, it means the experimental shared API would live in a different API group - like "authorization.experimental.k8s.io" - get the 3-4 implementations needed and evolve without concerns about backwards compat or stability until everyone is happy - and then copy it to the v1 API group.

Implementations can support the experimental api group for N releases - in parallel with v1.

Same model used for example for H3 - with different drafts using other names, and the final RFC using h3.

Apr 26 '24 23:04 costinm

if consensus exists to add a field to a core API directly as stable, with no experiment or proof - it should be added with whatever process is defined.

I don't believe anyone is suggesting this.

If a feature does not have consensus on moving directly to stable - it will still need a mechanism for experimentation and for implementations to prove the viability, users to provide feedback, etc.

The contention of most maintainers in this thread is that the existing Experimental channel model (as defined at https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels) is a better way to handle this both for implementations, and, importantly, for end-user experience.

Apr 29 '24 16:04 mikemorris

On Mon, Apr 29, 2024, 09:15 Mike Morris @.***> wrote:

if consensus exists to add a field to a core API directly as stable, with no experiment or proof - it should be added with whatever process is defined.

I don't believe anyone is suggesting this.

If a feature does not have consensus on moving directly to stable - it will still need a mechanism for experimentation and for implementations to prove the viability, users to provide feedback, etc.

The contention of most maintainers in this thread is that the existing Experimental channel model is a better way to handle this both for implementations, and, importantly, for end-user experience

I have not seen any comment suggesting that either users or implementations are happy with current experiment model or know a good way to handle any significant changes between experimental and v1.

In Istio it has been almost impossible to fix anything between alpha and v1.

Apr 29 '24 22:04 costinm

The contention of most maintainers in this thread is that the existing Experimental channel model (as defined at https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels) is a better way to handle this both for implementations, and, importantly, for end-user experience.

My concern is that that's because we haven't introduced many breaking changes into experimental channel yet. That's leading people to believe that experimental channel is more stable than it's intended to be.

In Istio it has been almost impossible to fix anything between alpha and v1.

+1, this is one of my biggest concerns. Although I'm not very familiar with Istio versioning, I'm very familiar with the problems we've faced in Kubernetes re: changing beta APIs. Whenever an API version is broadly accessible (beta in upstream Kubernetes, experimental in Gateway API), it becomes very difficult to make any breaking changes.

If we're not very careful here, we're going to end up with the same result all over again where it becomes impossible to change APIs, even if they're technically labeled as alpha. My theory is that having a stronger separation via separate API groups and names will initially be somewhat painful but will lead to a much more sustainable API long term. (Imagine the pressure on API reviewers if approving an alpha API meant that everything had to be ~perfect the first time because we could never change anything after that initial release.)

Apr 30 '24 00:04 robscott

I think https://github.com/kubernetes-sigs/gateway-api/pull/2955#discussion_r1560285876 is reasonable - some pain is okay if it's not a surprise, and it's not possible to accidentally break in-use resources.

Agreed, I think this is the best case scenario. Importantly it only works when you're changing an entire resource. If you're changing an experimental field in a stable API like HTTPRoute you simply don't have that option available. The only option I can think of is "painful surprise" unless we separate the release channels like I'm proposing here.

Apr 30 '24 00:04 robscott

On AKS we're evaluating an approach to allow users to "opt-out" of Gateway API management by the cluster provisioner - we'll install Standard channel CRDs by default when needed, but we want to let users "offboard" if they need functionality only available in the Experimental channel. Providing a safe path back to a managed Standard channel is a challenge currently though.

Yep, I think it's reasonable to offer a path to offboard CRD management, GKE also has this, but it's very difficult to offer a safe upgrade path back to managed stable CRDs. This proposal is an attempt to change that.

Check if any CRDs, fields (or enum values? what else?) in use are missing from the Standard channel. Attempting to overwrite CRDs missing in-use stored versions will block the "missing CRDs entirely bit", but we can still maybe handle this UX a bit nicer and earlier. We can't just compare against newer served versions because that wouldn't cover post-v1 changes like adding fields to HTTPRoute in the Experimental channel (I think the example @kflynn gave with a v1alpha7 HTTPRoute is not how we intend to make post-v1 backwards-compatible additions? Please LMK if I'm mistaken though.) This check may be difficult (and I don't want to manually maintain the logic), but I'm curious if we could do this with sufficient investment in some code generation. This is somewhat similar to the approach proposed in KEP-2558: Publish versioning information in OpenAPI except that we have the benefit of already having parsable channel flag comments.

Unfortunately I think it would be very difficult to maintain a tool like this. We'd need to have a tool that maintained the changes between every possible combination of CRDs and detect if any were set to a non-zero value. Even if we could detect this reliably, my working theory is that stable production usage of APIs should be entirely disconnected from experimental usage and they should be able to coexist within the same cluster. This approach would mean experimental usage in the dev namespace would prevent a prod upgrade from getting a newly graduated feature that is clearly needed.

Agree that we wouldn't end up with a v1alpha7 on a resource that's already made it to standard channel. Once it gets to that point the only changes allowed are backwards compatible and therefore no more version revs.

Warn user with sufficient detail.

If no warnings found (or y/N override passed?), find Gateway API group CRs with a newer available served version (for initial promotion from v1alpha to v1 use cases), create SVM migration and watch for completion. Report successful migration, provide instructions to move to Standard channel.

This doesn't really solve the problem for providers that are trying to provide a fully managed experience - ideally upgrades are safe and automatic. Our goal should be for a user to be able to start an upgrade and know that it will be safely executed - that's easy to accomplish if the only APIs installed by the provider are guaranteed to be stable and backwards compatible, but it falls apart if you introduce experimental APIs with the same name and group to the equation.

Apr 30 '24 00:04 robscott

I wrote a longer rant doc in the context of Istio - but IMO the concept of 'semantic versioning' for APIs is very harmful and has created major problems.

For protocols like HTTP/1.1, HTTP/2, HTTP/3 - or IPv4 and IPv6 - it works great, because they have mechanisms to be used at the same time and are all long-term stable. That's not the case with APIs or CRDs.

"Alpha" or "experimental" are just a way to justify launching APIs faster and skipping the hard work and due diligence (scale, security, usability, consistency, etc) - and putting the burden on the user to deal with any problems that are found in the API - or as is the case in Istio, getting stuck with whatever was barely reviewed as experimental because making changes is too painful and users are already relying on the API.

It is fine to launch a throw away API or CRD with a short support window - like the drafts that led to HTTP/3 in the protocol world, as long as it is clear the API will be dropped and replaced.

It is fine to launch a vendor API - with long term support, even if a 'least common denominator' API will also be supported later.

What is not fine is pretending it is ok for a user to take an experimental or alpha CRD and use it in any production environment ( 'to allow users to provide feedback' ) and expect we'll be able to make any structural or major changes and fix things afterwards, or play games with allowing some experimental APIs in production. If you need proof - look at Istio APIs and pseudo-APIs ( env variables, etc) - and how many real changes we had between 'alpha1' and 'v1'.

On Mon, Apr 29, 2024 at 5:54 PM Rob Scott @.***> wrote:

On AKS we're evaluating an approach to allow users to "opt-out" of Gateway API management by the cluster provisioner - we'll install Standard channel CRDs by default when needed, but we want to let users "offboard" if they need functionality only available in the Experimental channel. Providing a safe path back to a managed Standard channel is a challenge currently though.

Yep, I think it's reasonable to offer a path to offboard CRD management, GKE also has this, but it's very difficult to offer a safe upgrade path back to managed stable CRDs. This proposal is an attempt to change that.

Check if any CRDs, fields (or enum values? what else?) in use are missing from the Standard channel. Attempting to overwrite CRDs missing in-use stored versions will block the "missing CRDs entirely bit", but we can still maybe handle this UX a bit nicer and earlier. We can't just compare against newer served versions because that wouldn't cover post-v1 changes like adding fields to HTTPRoute in the Experimental channel (I think the example @kflynn https://github.com/kflynn gave with a v1alpha7 HTTPRoute is not how we intend to make post-v1 backwards-compatible additions? Please LMK if I'm mistaken though.) This check may be difficult (and I don't want to manually maintain the logic), but I'm curious if we could do this with sufficient investment in some code generation. This is somewhat similar to the approach proposed in KEP-2558: Publish versioning information in OpenAPI https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2558-publish-version-openapi except that we have the benefit of already having parsable channel flag comments.

Unfortunately I think it would be very difficult to maintain a tool like this. We'd need to have a tool that maintained the changes between every possible combination of CRDs and detect if any were set to a non-zero value. Even if we could detect this reliably, my working theory is that stable production usage of APIs should be entirely disconnected from experimental usage and they should be able to coexist within the same cluster. This approach would mean experimental usage in the dev namespace would prevent a prod upgrade from getting a newly graduated feature that is clearly needed.

Agree that we wouldn't end up with a v1alpha7 on a resource that's already made it to standard channel. Once it gets to that point the only changes allowed are backwards compatible and therefore no more version revs.

Warn user with sufficient detail.

If no warnings found (or y/N override passed?), find Gateway API group CRs with a newer available served version (for initial promotion from v1alpha to v1 use cases), create SVM migration and watch for completion. Report successful migration, provide instructions to move to Standard channel.

This doesn't really solve the problem for providers that are trying to provide a fully managed experience - ideally upgrades are safe and automatic. Our goal should be for a user to be able to start an upgrade and know that it will be safely executed - that's easy to accomplish if the only APIs installed by the provider are guaranteed to be stable and backwards compatible, but it falls apart if you introduce experimental APIs with the same name and group to the equation.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/gateway-api/pull/2912#issuecomment-2083997599, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUR2SHGVKFXARIDS7GO6DY73TUFAVCNFSM6AAAAABFPCBRUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTHE4TONJZHE . You are receiving this because you commented.Message ID: @.***>

Apr 30 '24 05:04 costinm

As always, I think that @kflynn's use cases are very useful for understanding the problems here.

Before I get started discussing that though, I think that it's important to review how the channels work on a per-object basis as well as on a per-field basis.

Versioning

We have two channels in each release bundle, experimental and standard.

Experimental includes

Resources that are at an alpha level (GRPCRoute meets this before v1.1).
Fields that are not considered standard yet in already GA'd resources. So the example about TeapotResponse stanza in HTTPRoute v1alpha7 is not really correct - there will never be a v1alpha7 of HTTPRoute. If we had major changes required, we could conceivably start with a v2 using v2alpha1, but that then would mean that version starting the whole experimental process over again. I cannot currently think of any way that we could need to do that for any of our graduated resources.

Standard includes:

v1 resources
Standard fields on those resources

That's it.

This problem arises because we have a rule that we don't include any alpha things in the Standard channel.

Problems

This means that for a graduation like GRPCRoute, there's no safe, easy migration path between the v1.1 Standard resources and the v1.0 Experimental resources, because the v1.1 Standard resources don't include any definitions for the v1alpha2 resources that, if you've been using the v1.0 Experimental resources, you are already using.

Technically this is fine, as @costinm mentions, because noone should be using GRPCRoute in any production scenario, and recreating all of your GRPCRoute resources from scratch means you need to check the resources as you reapply them.

This is a terrible experience for the most active members of our community though, who have been doing what we need and actually testing this functionality. In order to ensure that any GRPCRoute config in the cluster before upgrading to v1.1 is present, users will need to:

pull down all GRPCRoute objects from all namespaces, and save them as YAML
change the apiVersion field from gateway.networking.k8s.io/v1alpha2 to gateway.networking.k8s.io/v1
Install v1.1 standard
Reapply all the YAMLs they pulled down

This is an annoying, manual, error-prone process that Kubernetes has mechanisms designed to avoid, particularly in the case where objects can be safely round-tripped between versions, since there no incompatible changes. (Us maintainers work very hard to ensure this is the case!)

Rob's proposal

In this PR, @robscott makes the case that we should make this split more apparent to both users and implementation maintainers, by splitting the experimental code out into separate objects. This locks in the above process and makes it required for every experimental -> standard resource transition. Pros and cons of this approach as I see it:

Pro:

The split between experimental and standard is very clear.
The path for moving config between experimental and standard is also very clear. There is none. Users must manually make the changes in each and every object.
You can conceivably have different versions of the objects installed in the cluster. This allows people to experiment with new fields and objects in the same cluster as an implementation that only uses Standard objects. This also means you could conceivably be testing v1.1 experimental in the same cluster as running v1.0 Standard.

Con:

Having to make manual changes for these things is bad UX. The config is not actually different, the functional thing that we are saying here is "we can't guarantee that this transition is safe, so you have to have a human do it".
Having different versions of types with the same name (aside from their API Group) installed in the cluster is a recipe for disaster. This will mean that every interaction with a cluster with both Experimental and Standard resources installed will require, for example, kubectl get httproutes.gateway.networking.x-k8s.io or kubectl get httproutes.gateway.networking.k8s.io to disambiguate between the two. I think that if you're using the short name, which one you'll get is at best poorly defined.

Solving the migration problem for fields

Because of the way that Kubenetes handles unknown fields in persisted objects, changing from experimental channel to standard channel is not guaranteed to produce reliable behavior because the following can happen:

Experimental channel includes TeapotResponse filter as an experimental field for HTTPRoute. (HTTPRoute is already v1)
Ana uses the TeapotResponse filter in HTTPRoute objects, and this config is persisted to etcd.
Someone (Ian, Chihiro, or Ana) installs a Standard channel version of the Gateway definitions that does not include TeapotResponse.
GETs, LISTs, or any other read of this object will not include the TeapotResponse config. But it is still present in etcd, until something modifies the object, at which time the values will be pruned.
So, if anything at all touches the object, then the TeapotResponse config will be pruned and gone, as you would expect.

However, if nothing touches the object, and the TeapotResponse config moves to standard in a later version, then the config will be read out by the apiserver on reads, reappearing as if by magic.

Is this situation likely? no. Is it that bad? Probably not, but we can't guarantee it, which is critical. In practice, I think that it's very unlikely that objects would persist for that long without being modified at all, and if we performed the incantations to invoke the storage version migrator, then this issue will never arise, because the storage version migrator's whole job is to do a no-op write to the object to prevent exactly this sort of issue.

The other thing that could conceivably happen here is that we have a field with the same name, but a different behavior. In practice, again, we don't allow this as an API change, to prevent exactly this sort of thing. New behavior == new name.

In summary, solving this problem for fields with a high level of confidence involves ensuring that the storage version migrator or similar operation is running for anyone who is unsure.

Solving this problem for whole objects

For whole objects, the story is a little easier, I think.

To me, it seems that the simpler way to handle this problem is to relax the rule about not allowing any alpha resources in the Standard channel, if and only if the alpha resource being allowed is identical to the GA resource that's also included in Standard. This would mean that the user's upgrade process goes like this, using GRPCRoute as the example:

have working config of GRPCRoute at v1alpha2 using Gateway API v1.0 Experimental channel
install Gateway API v1.1 Standard, which will include both v1alpha2 and v1 GRPCRoute objects, for a defined number of versions. v1 is the storage version though.
Run the storage version migrator on all GRPCRoute objects.

You've now migrated your config, and can safely upgrade to the later release that removes GRPCRoute v1alpha2 from the storage versions. (When the storage version is v1, new objects will be saved as v1, and once the v1alpha2 is removed, attempted CRUD operations on v1alpha2 versions will fail).

This approach implies the following graduation process for graduating whole resources from Experimental to Standard:

The community decides that an Experimental resource is complete and marks it for graduation in the next release. At this time, the resource is frozen and no further changes will be accepted for it until after that release.
In the next release, the v1 resources are introduced, and the YAMLs are updated to include the v1 definitions, with the frozen experimental version available as an alternate storage version and definition. As part of this change, we also declare when the alpha versions of the object will be removed from the Standard install. This is only provided as an user convenience. The actual Go types are also changed at this point so that the alpha versions are type aliases to the v1 versions.
After the deprecation period ends, the alpha versions are removed from everywhere.

This is the same process we used for graduating the currently-GA resources from v1beta1 to v1, it just skips the beta part.

This allows the safe migration of the newly promoted resource only from Experimental to Standard, but the following things can still happen in an Experimental -> Standard migration.

Any Experimental fields in use in a GA object will be lost once the storage version migrator is run.
Experimental objects will actually stay in the cluster until the CRD definition is manually removed (since we can't remove objects as part of a kubectl install or similar operation.

Regardless of what we end up doing, I think that we need to prioritize documentation about how to move between versions and channels. This should be basically the same for most transitions, since things are either going to work for sure, or need a human to check.

This current proposal is effectively a way to enforce having to have a human check the experimental -> standard transition. I think we can do better than that.

May 01 '24 06:05 youngnick

What is not fine is pretending it is ok for a user to take an experimental or alpha CRD and use it in any production environment ( 'to allow users to provide feedback' ) and expect we'll be able to make any structural or major changes and fix things afterwards, or play games with allowing some experimental APIs in production.

FWIW I agree with this @costinm, and why I think a goal of allowing experimental and standard APIs for the same resource to coexist in the same cluster is a dangerous goal, because it encourages this behavior. Isolating experimental CRDs in an "edge release" cluster and splitting some traffic towards it from an external load balancer layer to test new behavior would be a safer approach from a platform engineering team perspective.

have working config of GRPCRoute at v1alpha2 using Gateway API v1.0 Experimental channel

install Gateway API v1.1 Standard, which will include both v1alpha2 and v1 GRPCRoute objects, for a defined number of versions. v1 is the storage version though.

@youngnick does this work if v1alpha2 is included in the CRD but not served (and maybe marked as deprecated: true)? What would be the implications of that? Referring to migration process described in https://static.sched.com/hosted_files/kcsna2022/75/KubeCon%20Detroit_%20Building%20a%20k8s%20API%20with%20CRDs.pdf#page=17

In the next release, the v1 resources are introduced, and the YAMLs are updated to include the v1 definitions, with the frozen experimental version available as an alternate storage version and definition

@youngnick v1 definitions added to YAML for which channel, both? Frozen experimental version still set as storage version? In Experimental channel only, or Standard channel too?

May 02 '24 15:05 mikemorris