Routing broken when you have a missconfig in VirtualServices and restart the gloo pod
Gloo Edge Version
1.14.x (latest stable)
Kubernetes Version
None
Describe the bug
Hi,
I am testing this in 1.14.6, I have a Gloo environment in which I have everything working, glooctl check reports OK. If I add a conflict between VirtualServices and restart Gloo the VirtualServices that were accessible before are not accessible any more.
- Deploy a HTTPS VirtualService with a missing SSL secret and restart Gloo Pod -> All HTTPS VirtualServices were unreachable
- Deploy a HTTP VirtualService with a duplicate domain and restart Gloo Pod -> All HTTP VirtualServices were unreachable
- Deploy a HTTP VirtualService with a duplicate domain and add an unrelated, correct VirtualService -> New VirtualService not reachable
Steps to reproduce the bug
I have uploaded an script to reproduce the bug:
kubectl apply -f working_example (see attachment) curl $(glooctl proxy url)/edu reachable
Error: kubectl apply -f vs_duplicate.yaml Restart Gloo Pod Correct VirtualService from Testsetup not reachable anymore
Expected Behavior
Changes that reports Errors should not break routing
Additional Context
No response
Related Issues:
- https://github.com/solo-io/gloo/issues/7976
- https://github.com/solo-io/gloo/issues/5720
- https://github.com/solo-io/gloo/issues/6115
This is a dupe
Dupe of #7976 ?
Treating this as a separate issue. Seems related to #7976, but that ticket seems to focus more on invalid routes in delegated routetables, where this ticket has clear requirements wrt behaviour on misconfigured VirtualServices.
As discussed today, the expected behaviour of the scenarios listed in the description are:
- When you deploy a HTTP VirtualService with a duplicate domain and restart Gloo Pod: Curren behaviour: All HTTP VirtualServices were unreachable Expected behaviour: All HTTP VirtualServices are reachable, except for the VirtualServices with a duplicate domain.
[!Note] Ideally only the virtualservice with the duplicate domain that was last applied is not reachable, and the the VS that was first created is accepted and routable.
-
When you deploy a HTTPS VirtualService with a missing SSL secret and restart Gloo Pod: Current behaviour: All HTTPS VirtualServices were unreachable. Expected behavior: All HTTPS VirtualService are reachable, except for the misconfigured VirtualService.
-
Deploy a HTTP VirtualService with a duplicate domain and add an unrelated, correct VirtualService: Current behaviour: New VirtualService not reachable. Expected behaviour: New VirtualService is available.
The TL;DR is that adding a single incorrect/misconfigured VirtualService breaks the Gloo Edge dynamic update, as no new VirtualService are accepted, and when the gateway pod restarts, it breaks the entire environment, as no VirtualServices will be reachable.
Additional context:
In my local minikube with gloo 1.16 I could validate that the following situations do not put Gloo in this unstable state:
- VirtualService referencing missing Upstream
- VirtualService referencing missing AuthConfig
- VirtualService referencing missing RouteTable
In general we have updated the robustness of validation in 1.15 and think that the best space we can be is to have validation on. Previously we would miss some updates and then get to a bad state. The ability to get to a bad state is much less possible now but if one gets into a bad state we have also added toggles to allow for easier escape valves.
For the above 3 options we believe that duplicate domains can be downgraded to a warning and allow for an actual translation to take place without strict validation instead of the current behavior
missing SSL secret and restart Gloo Pod is more of a possible larger extension to have a variant of invalid route replacements
Hi there, customer here. we would really appreciate solving this primarily without validation.
Reasons:
- We want to be able to restart all our pods at any time. Gloo is pretty much the only kubernetes extension where this is not possible as we have to manually ensure there are no errors. Oftentimes we need to fix resources that are owned by our customers (=tenants).
- Order. One of the basic premises of Kubernetes is that I only specify my intent and the system figures out the correct way to realize that intent. A webhook would require creating, updating and deleting the resources in the necessary order.
- Undefined behavior. Picture this: A VirtualService references a tls secret created by cert-manager. Now cert-manager wants to recreate this secret as the certificate is outdated. What's the correct behavior of the Webhook? Blocking the deletion of the tls secret? That can't be the solution as we need up-to-date certs. Allowing it? That puts Gloo in a bad state. How would the webhook know that a new tls secret will be created right after?
Especially bullet point 3 makes me believe that a webhook is not the right approach to solve these current bugs. It can be definitely helpful as a debugging tool but should not be the last resort of something so crucial to the whole system.
I would really like to understand why you can't just ignore the faulty resources. If they can be statically validated inside a webhook, why can't the same happen inside the gloo dynamic update loop?