gateway-api icon indicating copy to clipboard operation
gateway-api copied to clipboard

xRoutes do not report possible conflicts

Open rikatz opened this issue 3 weeks ago • 5 comments

This story/bug/feature enhancement request is about a bad UX I had while testing some scenarios of conflicts.

After creating a Gateway and attaching two equal similar routes to the Gateway, as a user I have no report that something may be wrong. This leads to a misunderstanding of why my routes are not working (or are working but giving me wrong answers), as there's no sign of my route not being the one really programmed on the proxy.

Let's take a look into the following manifest:

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
spec:
  gatewayClassName: someclass
  listeners:
  - name: default
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: echo
  namespace: user1
spec:
  parentRefs:
  - name: gateway
  hostnames: ["some.example.tld"]
  rules:
  - matches:
    - path:
        type: Exact
        value: /
    backendRefs:
    - name: echo
      port: 3000
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: echo
  namespace: user2
spec:
  parentRefs:
  - name: gateway
  hostnames: ["some.example.tld"]
  rules:
  - matches:
    - path:
        type: Exact
        value: /
    backendRefs:
    - name: echo
      port: 3000

Once applied, both users from namespaces user1 and user2 will be expecting that a curl to "http://some.example.tld" would return their backend, but just user1 will get the right answer, while user2 will have the feeling that everything is working fine, but the app is misbehaving and returning something different.

Why?

Because the status of both routes never let any of them know that one of the routes wasn't really programmed. Looking at the status of both, they have the same answer:

status:
  parents:
  - conditions:
    - lastTransitionTime: "2025-12-02T18:13:39Z"
      message: Route is accepted
      observedGeneration: 2
      reason: Accepted
      status: "True"
      type: Accepted
    - lastTransitionTime: "2025-12-02T18:13:39Z"
      message: Resolved all the Object references for the Route
      observedGeneration: 2
      reason: ResolvedRefs
      status: "True"
      type: ResolvedRefs

Once the older route is deleted, the new one starts working.

So we need to start providing some more information for users when a route was properly programmed or not, and why it may not have been programmed (eg.: conflicted)

Tested implementations

The situation above was tested and confirmed with:

  • Envoy Gateway 1.5.0
  • Istio (1.29-alpha.d16be7b7a857b66a7a633f4b532c89b8428b485a)
  • Cilium 1.18.2

rikatz avatar Dec 02 '25 18:12 rikatz

This feels like we might have a gap in conformance testing for this specific case.

I would expect that HTTPRoutes should follow the documented conflict resolution guidelines, specifically

If everything else is equivalent (including creation timestamp), precedence should be given to the resource appearing first in alphabetical order (namespace/name)

...and the conflicting HTTPRoute should be set as Accepted: false. I'm not entirely sure what level of conflict granularity we're able to achieve consistently across dataplane implementations. While a conflict on exact path is pretty obvious, it gets much messier merging intersections like some header or query params only matches across different routes, especially considering it's likely desirable to have some degree of merging for things like two separate HTTPRoutes for different path prefixes attached to the same listener. There might also be some consideration of this logic being difficult to implement in the Gateway controller if the actual config merging (and conflict/failure) only happens during async dataplane programming (which I'm expecting may be the case with Envoy in the cited implementations).

The reason that Gateway listeners have a separate Conflicted status condition type is that (historically, prior to ListenerSet), they were all written within the same resource and it was preferable to avoid rejecting the entire Gateway if only a subset of listeners conflicted.

mikemorris avatar Dec 02 '25 19:12 mikemorris

@rikatz I tested this with Airlock Microgateway 4.8, and I’m seeing the same behavior that you observed with the other implementations.

I think precedence is not the same as acceptance, because the routing precedence is indeed working correctly. It might be clearer to phrase it like this:

If everything else is equivalent (including creation timestamp), precedence should be given to the resource appearing first in alphabetical order (namespace/name). The other resources should have Accepted condition set to false with the reason Conflicted.

Route and their rules are often be only partially conflicted. However, if two routes are exactly identical and precedence is determined solely by creation time or alphabetical order, then I believe it’s worthwhile to set the status to false. In such a scenario, a request can never be routed to the lower-precedence route as long as the other one exists.

snorwin avatar Dec 02 '25 19:12 snorwin

Yes, for partial validity the rules are supposed to be:

  • If there's at least one valid thing (Rule in HTTPRoute, Listener in Gateway), then the object is partially valid, and can be accepted.
  • If there are no valid things (the spec actually says "if the object would produce no config in the underlying dataplane", but this is a shorter way to write it), then the whole object must not be accepted.

The tricky part with the HTTPRoute test is that they do conflict, but only in terms of the parent ref. In some readings of Accetped, that means they could be Accepted (because they are semantically and syntactically valid), but Conflicted. However, we've never really resolved a bit of ambiguity around if Accepted means "locally valid" or "has attached to a Gateway" for HTTPRoute. In some places, we use it in one way, in some, the other. We do have Programmed to indicate that as well though.

Regardless, the case that Ricardo calls out should definitely produce a Conflicted state somewhere on the HTTPRoute that doesn't make it in, and we should have a conformance test to validate that.

I think our two options are:

  • Conflicted HTTPRoute gets Accepted true, but Programmed false, with a Conflicted reason. This indicates the current state pretty accurately, but requires folks to check Programmed correctly.
  • Conflicted HTTPRoute gets Accepted false, and Programmed false, both with a Conflicted reason. This makes Accepted mean more than just "locally correct and accepted for processing", which is a bit of an expansion on what it currently is.

After writing that out, I think I favor the former, but I could do either.

Edit: Whatever we do is probably going to really suck to implement, because now we will have to deep comparison into the HTTPRoutes to determine config winners (I suspect that's why none of us have done this yet).

youngnick avatar Dec 03 '25 00:12 youngnick

Whatever we do is probably going to really suck to implement, because now we will have to deep comparison into the HTTPRoutes to determine config winners (I suspect that's why none of us have done this yet).

Yea, this is my concern with the desire to surface this in status - right now I expect controllers are just accepting syntactically valid HTTPRoutes and shipping this config off to the dataplane to let it sort out conflicts/precedence.

mikemorris avatar Dec 03 '25 17:12 mikemorris

I can confirm I've seen different behavior per controllers, so I cannot guarantee that all of them are dropping.

That said, I know and imagine the pain that will be to do deep comparison. I am not on an implementation side, but wondering if at least adding a indexfield for each gateway/route/hostname/path would help (well, speaking it loud seems a very bad idea...)

rikatz avatar Dec 05 '25 18:12 rikatz