gateway-api icon indicating copy to clipboard operation
gateway-api copied to clipboard

ResolvedRefs condition for GatewayClass parametersRef

Open sjberman opened this issue 1 year ago • 7 comments

A discussion was had in Slack about adding a ResolvedRefs condition to a GatewayClass status to reflect the state of a parametersRef object that is attached. This led to a couple different concerns.

  1. How would this coexist with the current InvalidParameters reason that exists for the Accepted condition on a GatewayClass? Potential flows and their status outcomes:

    • Create GatewayClass with bad parametersRef target: GatewayClassStatus[{ type: Accepted, status: False?, reason: InvalidParameters? }, { type: ResolvedRefs, status: False, reason: ResolvedRefs? }]
    • Create GatewayClass with valid parametersRef target, but invalid values: GatewayClassStatus[{ type: Accepted, status: False?, reason: InvalidParameters? }, { type: ResolvedRefs, status: True, reason: ResolvedRefs }]
    • Create GatewayClass with valid parametersRef target and values -> delete parametersRef resource: GatewayClassStatus[{ type: Accepted, status: True?, reason: InvalidParameters? }, { type: ResolvedRefs, status: True, reason: ResolvedRefs }] a. Gateway created prior to this change expected to be okay unless dynamic changes are allowed, if so what happens? b. Gateway created referencing this GatewayClass after this change should set GatewayStatus[{ type: Accepted, status: False?, reason: ? }] c. Does GatewayClass move to Accepted: False?
    • Create GatewayClass with valid parametersRef target and values -> change parametersRef resource content so one field becomes invalid: GatewayClassStatus[{ type: Accepted, status: True?, reason: InvalidParameters? }, { type: ResolvedRefs, status: True, reason: ResolvedRefs }] a. Gateway created prior to this change expected to be okay unless dynamic changes are allowed, if so what happens? b. Gateway created referencing this GatewayClass after this change should set GatewayStatus[{ type: Accepted, status: False?, reason: ? }] c. Does GatewayClass move to Accepted: False?
  2. Another concern that was raised is what the expectations are if dynamic changes are allowed to a parametersRef (by either adding/removing the reference, deleting the resource, or updating the fields within it). Specifically, if a GatewayClass goes from being Accepted to not Accepted due to an invalid update to the parametersRef, how does that affect the entire config tree below it? By following a true declarative system, all Gateways under this class should no longer be Accepted, but this would nullify the entire configuration and be very disruptive. The other option is what was mentioned in the examples above, where previous Gateways live on, while new ones are not Accepted. Issue here, however, is that if the control plane restarts, it loses the previous state of the valid parametersRef, so the original Gateways that were left alone will no longer receive those valid params, and the config is void. Note that this is technically an issue today with the existing InvalidParameters reason on the Accepted condition, so nothing new with the proposed ResolvedRefs condition.

  3. Finally, what does conformance look like? While not stated in the spec, the existing ResolvedRefs condition for HTTPRoutes is required in the conformance tests. Would that make this condition a requirement for a GatewayClass?

sjberman avatar Mar 29 '24 14:03 sjberman

In relation to point 2, one proposal our implementation has come up with is to set Accepted: true, reason: InvalidParameters, message: GatewayClass is Accepted, but params are invalid for X reason and will be ignored. This means we ignore the params and allow the existing configuration to continue on, while giving a clear message to users as to what is happening.

The risk here is that if paramsRef goes from valid to invalid, any settings that were in there would be reverted to defaults, leading to a change in behavior or leading to downtime. The impact of this is obviously dependent on the implementation and the settings that exist in that paramsRef. Could be a better alternative than tearing down the entire config tree due to a non-Accepted GatewayClass, though.

sjberman avatar Mar 29 '24 16:03 sjberman

I believe the current fuzziness in the spec around "snapshot/templating" behavior when creating Gateways referencing a configured GatewayClass combined with the difficulty handling of "dynamic" changes to parameters (by modifying either the reference or the referenced object) helps create the justification for adding a parametersRef field under infrastructure directly on Gateways, and trying to clean up this behavior to be better-defined.

Adding a parametersRef field as part of the infrastructure stanza was originally proposed and discussed extensively in https://github.com/kubernetes-sigs/gateway-api/pull/1868 and https://github.com/kubernetes-sigs/gateway-api/pull/1757 but was ultimately not included in the implementation in https://github.com/kubernetes-sigs/gateway-api/pull/2440 - I can't find the precise comment where it was blocked/dropped though (I think it happened after the GEPs merged, during API review).

/cc @howardjohn @robscott

mikemorris avatar Apr 02 '24 16:04 mikemorris

Doesn't that just move the blast radius down one level? Unless we are considering the impact of an invalid policy attachment on a Gateway to be implementation-specific, versus stating in the spec that the Accepted condition should be false.

sjberman avatar Apr 02 '24 17:04 sjberman

Doesn't that just move the blast radius down one level?

At a minimum, yes, and that alone feels like a good improvement.

I'd expect that it could be more reasonable for a Gateway to stay { type: Accepted, status: True } and just be missing some configured parameters (expressed through its existing { type: ResolvedRefs, status: False } or adding { type: PartiallyInvalid, status: True, reason: UnsupportedValue } similar to https://github.com/kubernetes-sigs/gateway-api/pull/2429 for HTTPRoute) than to figure out how to deal with a dynamic class/template becoming invalid in any more standardized way that doesn't have a huge blast radius.

mikemorris avatar Apr 02 '24 18:04 mikemorris

Doesn't that just move the blast radius down one level?

That was also one of my original concerns. The proposal by @howardjohn in https://github.com/kubernetes-sigs/gateway-api/pull/2924 does limit the blast radius by making paramsRefs namespace-local, so the blast radius of a change would also be limited to the local namespace.

Unless we are considering the impact of an invalid policy attachment on a Gateway to be implementation-specific, versus stating in the spec that the Accepted condition should be false.

Maybe we should take these options and apply them more broadly?

https://github.com/kubernetes-sigs/gateway-api/blob/37a7ffb640bb077eb96843abdb647c2c8efea8d2/apis/v1/shared_types.go#L404-L418

robscott avatar Apr 03 '24 00:04 robscott

@robscott That's similar to the approach that I mentioned in my first comment for the GatewayClass. Basically just keep things in a good state, but fall back on the paramsRef to the default values (our implementation may just have to use this approach since we are intending to add infra configuration at the top level and the GC paramsRef is the way to do that right now). It's almost the lesser of two evils. Your paramsRef config will be reverted to a valid state, which could cause disruption downstream since values are changing, but it's less disruptive than nullifying the entire downstream config.

This obviously is the big issue to figure out, but I also want to make sure we touch on the questions around the existence of a ResolvedRefs condition and the conformance for it.

sjberman avatar Apr 03 '24 14:04 sjberman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 02 '24 15:07 k8s-triage-robot

With #2924 merging, there's a path for configuring per-Gateway custom parameters now.

I'm not entirely sure if the existing InvalidParameters GatewayConditionReason for the Accepted GatewayConditionType is sufficient or if we should consider adding ResolvedRefs to Gateway conditions in addition to Listener conditions, but I think it's more clear now that we shouldn't be encouraging this pattern on the GatewayClass resource.

mikemorris avatar Jul 11 '24 19:07 mikemorris

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 10 '24 19:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Sep 09 '24 20:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 09 '24 20:09 k8s-ci-robot