Gloo Edge Version

1.14.x (latest stable)

Kubernetes Version

None

Describe the bug

Hi,

I am testing this in 1.14.6, I have a Gloo environment in which I have everything working, glooctl check reports OK. If I add a conflict between VirtualServices and restart Gloo the VirtualServices that were accessible before are not accessible any more.

Deploy a HTTPS VirtualService with a missing SSL secret and restart Gloo Pod -> All HTTPS VirtualServices were unreachable
Deploy a HTTP VirtualService with a duplicate domain and restart Gloo Pod -> All HTTP VirtualServices were unreachable
Deploy a HTTP VirtualService with a duplicate domain and add an unrelated, correct VirtualService -> New VirtualService not reachable

Steps to reproduce the bug

I have uploaded an script to reproduce the bug:

kubectl apply -f working_example (see attachment) curl $(glooctl proxy url)/edu reachable

Error: kubectl apply -f vs_duplicate.yaml Restart Gloo Pod Correct VirtualService from Testsetup not reachable anymore

local-gloo.zip

Expected Behavior

Changes that reports Errors should not break routing

Additional Context

No response

Related Issues:

https://github.com/solo-io/gloo/issues/7976
https://github.com/solo-io/gloo/issues/5720
https://github.com/solo-io/gloo/issues/6115

Jun 22 '23 14:06 edubonifs

This is a dupe

Jun 22 '23 15:06 SantoDE

Dupe of #7976 ?

Jun 23 '23 11:06 SantoDE

Treating this as a separate issue. Seems related to #7976, but that ticket seems to focus more on invalid routes in delegated routetables, where this ticket has clear requirements wrt behaviour on misconfigured VirtualServices.

As discussed today, the expected behaviour of the scenarios listed in the description are:

When you deploy a HTTP VirtualService with a duplicate domain and restart Gloo Pod: Curren behaviour: All HTTP VirtualServices were unreachable Expected behaviour: All HTTP VirtualServices are reachable, except for the VirtualServices with a duplicate domain.

[!Note] Ideally only the virtualservice with the duplicate domain that was last applied is not reachable, and the the VS that was first created is accepted and routable.

When you deploy a HTTPS VirtualService with a missing SSL secret and restart Gloo Pod: Current behaviour: All HTTPS VirtualServices were unreachable. Expected behavior: All HTTPS VirtualService are reachable, except for the misconfigured VirtualService.
Deploy a HTTP VirtualService with a duplicate domain and add an unrelated, correct VirtualService: Current behaviour: New VirtualService not reachable. Expected behaviour: New VirtualService is available.

The TL;DR is that adding a single incorrect/misconfigured VirtualService breaks the Gloo Edge dynamic update, as no new VirtualService are accepted, and when the gateway pod restarts, it breaks the entire environment, as no VirtualServices will be reachable.

Feb 02 '24 12:02 DuncanDoyle

Additional context:

In my local minikube with gloo 1.16 I could validate that the following situations do not put Gloo in this unstable state:
- VirtualService referencing missing Upstream
- VirtualService referencing missing AuthConfig
- VirtualService referencing missing RouteTable

Feb 02 '24 15:02 DuncanDoyle

In general we have updated the robustness of validation in 1.15 and think that the best space we can be is to have validation on. Previously we would miss some updates and then get to a bad state. The ability to get to a bad state is much less possible now but if one gets into a bad state we have also added toggles to allow for easier escape valves.

For the above 3 options we believe that duplicate domains can be downgraded to a warning and allow for an actual translation to take place without strict validation instead of the current behavior

missing SSL secret and restart Gloo Pod is more of a possible larger extension to have a variant of invalid route replacements

Feb 06 '24 20:02 nfuden

Hi there, customer here. we would really appreciate solving this primarily without validation.

Reasons:

We want to be able to restart all our pods at any time. Gloo is pretty much the only kubernetes extension where this is not possible as we have to manually ensure there are no errors. Oftentimes we need to fix resources that are owned by our customers (=tenants).
Order. One of the basic premises of Kubernetes is that I only specify my intent and the system figures out the correct way to realize that intent. A webhook would require creating, updating and deleting the resources in the necessary order.
Undefined behavior. Picture this: A VirtualService references a tls secret created by cert-manager. Now cert-manager wants to recreate this secret as the certificate is outdated. What's the correct behavior of the Webhook? Blocking the deletion of the tls secret? That can't be the solution as we need up-to-date certs. Allowing it? That puts Gloo in a bad state. How would the webhook know that a new tls secret will be created right after?

Especially bullet point 3 makes me believe that a webhook is not the right approach to solve these current bugs. It can be definitely helpful as a debugging tool but should not be the last resort of something so crucial to the whole system.

I would really like to understand why you can't just ignore the faulty resources. If they can be statically validated inside a webhook, why can't the same happen inside the gloo dynamic update loop?

Feb 19 '24 12:02 isihu

Routing broken when you have a missconfig in VirtualServices and restart the gloo pod

Gloo Edge Version

Kubernetes Version

Describe the bug

Steps to reproduce the bug

Expected Behavior

Additional Context

Related Issues: