contour icon indicating copy to clipboard operation
contour copied to clipboard

Wait to discover all K8s resources before sending xDS responses

Open phylake opened this issue 5 years ago • 6 comments

As far as I can tell Contour will send whatever resources it has whenever xDS DiscoveryRequests come in. This can cause xDS resources to be torn down in Envoy on a Contour restart. Contour should do a synchronous (read: eager and blocking) lookup of its CRDs, Ingresses, Services, Endpoints, Secrets, etc. on startup before responding to DiscoveryRequests.

This is causing a variety of issues depending on the resource. Examples include:

  1. Missing EDS causes 503s
  2. Missing RDS causes 404s
  3. Missing SDS causes SSL handshake errors

/cc @bryanlatten @lrouquette


Tasks

  • [x] Wait for k8s informer caches to sync
  • [ ] Wait for dag to be fully built before starting gRPC server

Blocked

  • [x] Leader election #1204

phylake avatar Jul 29 '19 21:07 phylake

This is different than #1178 which is about ordering. Even if #1178 is fixed Contour will still send only a subset of resources, subject to K8s informer timing, and causes 404s in Envoy when Contour restarts because Contour hasn't discovered all the resources in K8s that it previously programmed Envoy with

Example for CDS:

Contour1 is Contour 1 Contour2 is Contour 2 (i.e. the Contour that starts after Contour1 dies) Cluster1 is some upstream service Cluster2 is some other upstream service

Contour1 discovers Cluster1 and responds to CDS with it. It later discovers Cluster2 and responds to CDS with Cluster1 and Cluster2. We'll consider Envoy "fully programmed" for the purpose of this example. Contour1 is terminated for some reason (upgrade, node death, etc.)

Contour2 discovers Cluster1 and responds to CDS with it. Note that Cluster2 is missing which means Envoy will remove it. From the docs:

LDS/CDS resource hints will always be empty and it is expected that the management server will provide the complete state of the LDS/CDS resources in each response. An absent Listener or Cluster will be deleted.

Envoy will also remove the corresponding EDS and RDS.

When a Listener or Cluster is deleted, its corresponding EDS and RDS resources are also deleted inside the Envoy instance.

Whatever cluster (and corresponding resources) was described by Cluster2 now returns 404s until Contour2 discovers Cluster2 and programs Envoy with Cluster1 and Cluster2. Assuming #1178 is fixed then EDS and RDS will come in order per the xDS protocol spec.

This problem isn't exclusive to CDS. It's a general problem that needs addressed for all K8s resources that can end up in xDS responses.

phylake avatar Jul 31 '19 18:07 phylake

Seemingly related: https://github.com/heptio/contour/issues/1286

bryanlatten avatar Aug 01 '19 14:08 bryanlatten

Yup, we do wait for the caches to sync up (https://github.com/heptio/contour/blob/master//cmd/contour/serve.go#L383), but we're not blocking on the gRPC connection to xDS.

stevesloka avatar Aug 07 '19 19:08 stevesloka

Sure, you're sync-ing on an empty cache (the SharedInformer Store is empty), so you're not going to wait very long. We implemented a synchronous discovery of the resources, but that has some ramifications preventing me from filing a PR as-is. I think what needs to happen here is something like this:

  1. Start the SharedInformer
  2. Wait for it to Sync
  3. Wait for the cache to sync
  4. Then, and only then, start the gRPC server

I'll see if I can come up with something

lrouquette avatar Aug 07 '19 21:08 lrouquette

Moving tentatively to the next milestone. The plan is to not bring up the grpc server until the shared informer has synced and we have populated the grpc cache at least once.

davecheney avatar Aug 19 '19 02:08 davecheney

In #1765 we added some logic to have the caches sync before starting the gRPC server, however, it's possible that the final DAG still isn't yet built. Leaving this as the open remaining task.

stevesloka avatar Oct 24 '19 15:10 stevesloka

this issue is seriously affecting us as we have many HTTPProxy and it will cause some minutes of downtime with every contour restart!

$ oc get httpproxy -A | wc -l
3407

m-yosefpor avatar Aug 06 '23 08:08 m-yosefpor