contour
contour copied to clipboard
Envoy can serve 503 when transitioning between services.
What steps did you take and what happened: In Knative we have an automated test that:
- Creates a Pod and Service (waits for Endpoints to have ready addresses)
- Creates a KIngress (which github.com/mattmoor/net-contour turns into HTTPProxy resources)
Once the KIngress is "Ready", the test starts a prober, which spins checking for 200s to ensure traffic isn't dropped.
The test then updates the KIngress (and therefore the HTTPProxy resources) N times (I think N is 10), as follows:
- Create a new Pod and Service (waits for Endpoints to have Ready addresses)
- Update the KIngress to shift traffic over to the new Service
- When the KIngress reports "Ready", check that it serves the right message for the new Pod.
You can see from the test log (annotated to point to stuff) that the 503s happen exactly when things cut over:
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:122: Rolling out "update-jxivnwzl" w/ "update-efvbziod" <-- We will start to see [update-efvbziod] soon!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
update_test.go:188: [update-vkizlzpb] Got OK status!
util.go:844: Got unexpected status: 503, expected map[200:{}]
util.go:844: HTTP/1.1 503 Service Unavailable
Date: Fri, 24 Jan 2020 02:44:32 GMT
Server: envoy
Content-Length: 0
update_test.go:188: [update-efvbziod] Got OK status! <-- After the 503, the new version is hit.
update_test.go:188: [update-efvbziod] Got OK status!
update_test.go:188: [update-efvbziod] Got OK status!
update_test.go:188: [update-efvbziod] Got OK status!
update_test.go:188: [update-efvbziod] Got OK status!
What did you expect to happen:
This test was intended to validate that updates to the KIngress resource (and therefore the underlying system) are "hitless" (without downtime).
In Knative the default deployment mechanism rolls traffic to the latest deployed snapshot of their code (aka Revision) as soon as it's ready, so this aggressive update and rollout isn't that contrived.
Anything else you would like to add:
I added a variant of the test that keeps the same Pod/Service and just rotates the programming (which header to set) and AFAIK it has never failed. This was a black-box attempt to isolate the problem to Endpoints changes.
I think @jpeach has a probably more accessible repro case for this, I'll let him share.
Environment:
- Contour version: 1.1 / HEAD
- Kubernetes version: 1.15
- Kubernetes installer & version: GKE
- Cloud provider or hardware configuration: Google
- OS (e.g. from
/etc/os-release): I run the test harness on my macbook, but the above snippet is from Prow (some linux container).
Test configuration deployment: https://github.com/jpeach/kustomize/tree/master/knative/configurations/contour
I had a play with this this morning and have (for me) a simpler reproducer.
a. deploy httpbin b. duplicate the httpbin service object, call it httpbin2 c. in a loop, edit the k8s ingress object to switch between httpbin and httpbin2. You could also use httpproxy.
Fundamentally this is repeatedly changing the name of the service for the route /
What I observed, with a 5qps load test running against the ingress
Status code distribution:
[200] 1148 responses
[503] 52 responses
What I think is happening is
- when the dag is rebuilt, the cluster for default/httpbin is no longer referenced so it is removed from CDS. At the same time the cluster for default/httpbin2, previously unreferenced, is added to CDS. This change is atomic from the POV of envoy.
- As the httpbin2 cluster was previously unknown to envoy it at least has to open an EDS connection for the default/httpbin2 cluster and prepare those endpoints. During this time there are no endpoints to serve the request and envoy returns 503, no health endpoints.
[2020-01-27 00:55:53.624][1][info][upstream] [source/common/upstream/cds_api_impl.cc:67] cds: add 2 cluster(s), remove 3 cluster(s)
[2020-01-27 00:55:53.625][1][info][upstream] [source/common/upstream/cds_api_impl.cc:83] cds: add/update cluster 'default/httpbin/8080/da39a3ee5e'
[2020-01-27 00:55:53.625][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:613] removing cluster default/httpbin2/8080/da39a3ee5e
[2020-01-27 00:55:53.625][1][info][upstream] [source/common/upstream/cds_api_impl.cc:94] cds: remove cluster 'default/httpbin2/8080/da39a3ee5e'
[2020-01-27 00:55:53.930][1][info][upstream] [source/common/upstream/cds_api_impl.cc:67] cds: add 2 cluster(s), remove 3 cluster(s)
[2020-01-27 00:55:53.930][1][info][upstream] [source/common/upstream/cds_api_impl.cc:83] cds: add/update cluster 'default/httpbin2/8080/da39a3ee5e'
[2020-01-27 00:55:53.931][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:613] removing cluster default/httpbin/8080/da39a3ee5e
[2020-01-27 00:55:53.931][1][info][upstream] [source/common/upstream/cds_api_impl.cc:94] cds: remove cluster 'default/httpbin/8080/da39a3ee5e'
This phase can be extended with things like cluster warmup and health checks. I did not investigate this.
- repeat for the transition from httpbin2 -> httpbin
ISTM that the core of the issue is when an old cluster is replaced with a new cluster (a cluster is just shorthand for a port on a k8s service) the old cluster immediately winks out of existence at the same time as the new cluster appears. The old cluster going away is not a big deal because its attached to an older version of the envoy Route configuration. The new cluster appearing and being immediately expected to handle traffic is the root cause.
Part of the problem is clusters which are not actively referenced by a valid httpproxy, ingressroute, or ingress object never make it into CDS. This is for several reasons.
- Reduced CDS size and update frequency. Obviously k8s contains many services which are not part of a HTTP application, filtering those out at various levels reduces the CDS update rate, see #499
- For security reasons we shouldn't make available in the CDS tables services which are not referenced by a valid ingress -- none of the services in kube-system for example are present in CDS.
The bottom line is, when the dag is rebuilt, if there is not an reference from a valid ingress type object to a service, it will be excluded from CDS. The problem is, as soon as that reference exists, the cluster appears in CDS. This appears in two permutations:
a. Vhost A points / to cluster A. CDS is updated to remove cluster A and introduce cluster B, then RDS is updated to point Vhost A / to cluster B. Between the first and second operations RDS points to cluster A which has been removed from CDS. Also, envoy may be in the process of querying EDS for the endpoints for cluster B. b. Vhost A points / to cluster A. RDS is updated to point Vhost A / to Cluster B. CDS is updated to remove cluster A and introduce cluster B. Between the first and second operations RDS points to cluster B which is not present. When cluster B arrives, envoy has to find its endpoints before traffic can be served.
The most obvious solution to me is cluster B needs to be introduced and warmed before RDS is changed. However inferring this action from the snapshots we take of the state of the k8s API during dag build is complicated. Adding extra entries to CDS because they will shortly be referenced by RDS is a difficult problem to solve.
xref: #1178
Just wanted to check in on this and see if there had been any epiphanies here. This test scenario is still red for us, and I'd really like to see us get that resolved before we start to advocate this seriously to users. Is there any hope for this in the 1.3 timeframe?
I think moving to a single cluster and using locality weights might be a solution here. We need to do that anyway to solve, https://github.com/projectcontour/contour/issues/1119, so might be a time to think through how that might work to remove the Envoy WeightedClusters (envoy_api_v2_route.WeightedCluster)
I did some playing around with how this might work today, but it needs some discussion as it is a pretty big change to how Contour works today.
This needs to be a proper design doc, but wanted to get my thoughts out there quickly now to discuss:
Today Contour mirrors Kubernetes Services with Envoy Clusters, but only creates clusters in Envoy (as @davecheney mentioned earlier) when they are referenced. What if these clusters were created based upon the route they are used for? Then when endpoints change, they need to update EDS with the proper new "route" cluster. This does potentially mean that the same service referenced from two different ingress objects would be duplicated, but I don't think this is a common scenario.
When a service change happens, instead of Contour needing to delete the old envoy cluster and spin up the new one, it would just need to swap endpoints for the existing routing cluster. Given the old cluster's endpoints are still available, then during the switch requests would still be handled. Contour still only creates clusters for referenced services in Kubernetes, but there needs to be a way to map endpoints to this new route cluster idea.
Secondarily, I'd like to use this "single cluster" model to apply to service weights and combine endpoints together in the same cluster, but that should be a different work step since there are other issues that come to play by implementing that.
Thoughts @davecheney?
@stevesloka Were you ever able to get our e2e testing to verify your PoC?
Hey @mattmoor , I got the tests up and running. My poc needs some work to get working. The idea is there, but the impl isn't just yet to validate it all.
Ok, well once you have a PoC that can run the update test cleanly it'd be a useful tool to ensure that some of the other redness we are seeing with Contour isn't due to this.
Ok, well once you have a PoC that can run the update test cleanly it'd be a useful tool to ensure that some of the other redness we are seeing with Contour isn't due to this.
What are the other issues?
Haven't triaged them yet because they are outside of ingress conformance, but I think they generally manifest as 503s, so tough to tell. Most are in tests are dealing with some form of rollout, which is where this issue flares up, so hasn't seemed worth sinking the time yet.
So with the scheme I just implemented in our net-contour controller, before we update our core HTTPProxy resources we first create an "endpoint probe" configuration that creates a virtual host per K8s service referenced in the new spec. It then probes each of these services for readiness before finally going back and updating the core HTTPProxy resources.
This guarantees that every single Envoy pod has the Endpoints data before the HTTP Proxy resources that are actually serving live traffic are updated. This improves things a lot, but even with this we are still seeing 503s.
If I make our e2e tests run 5x and re-enable TestUpdate, it fails at least once on each run. I've been trying to make sense of things aligning timestamps across the failures, Contour logs (which are remarkably terse, I've been instrumenting a fair amount 😞 ), and Envoy logs.
I can say with some level of certainty that the 503s correlate pretty strongly with a flurry of GRPC activity between Envoy and Contour, but I'm still learning to interpret the couple lines of Envoy logs I keep seeing over and over, so I haven't yet noticed a strong correlation there.
Turning on the Envoy debug logging is a bit more useful. For every 503 NR, we hit one of these log statements (which others may find obvious 😅). This let me look at the lifecycle of the cluster in question, and it is very clearly serving a 503 based on the cluster that's going out vs. the cluster that's coming in.
I think that what I am seeing now is actually the opposite of what I fixed above. It isn't a problem with the new Endpoints coming into existence, but the old Cluster leaving existence before it has been removed from all of the routing. 🤔
I will try to hack something together to confirm this, perhaps I can abuse the Endpoint probing to preserve both sets of clusters from before the rollout until it is totally complete (as measured by our prober).
With a PoC of this, I now have two clean runs of the full Knative networking conformance (each running 5x) without any 503s. I also have two more runs of 5x just TestUpdate without 503s.
I will try to get this cleaned up and checked in to the continuous build can beat on it, but I'm fairly optimistic 🤞
@mattmoor I have reason to believe this is causing problems in the use of contour that I've been involved with. From the last comment it seemed you where positive and had a solution but it never made it into the project. Can you provide an update?
My PoC was working around this downstream in Knative's ingress implementation layered on top of Contour, which takes advantage of some readiness probing logic we have in Knative to ensure networking programming has been rolled out, and then programming Contour in two passes.
I put together some slides on this here (it is pretty technical and assumes a fair amount of Knative knowledge). Slides 6-8 specifically speak to this problem.
I haven't been involved with Contour for at least a year now, so maybe @dprotaso can connect you with someone with more recent experience here.
I think that @sunjayBhatia is working on our xDS handling code, to see if we can make it handle this use case a bit better. Maybe he has some more info here?
ultimately we will likely need to use ADS with SOTW or incremental xDS unless we want to take on coordinating sending ordered updates between the various xDS resource streams
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
- After 60d of inactivity, lifecycle/stale is applied
- After 30d of inactivity since lifecycle/stale was applied, the Issue is closed
You can:
- Mark this Issue as fresh by commenting
- Close this Issue
- Offer to help out with triage
Please send feedback to the #contour channel in the Kubernetes Slack
/lifecycle frozen