contour icon indicating copy to clipboard operation
contour copied to clipboard

Endpoint(Slice)Translator may not be fully synced before xDS server starts

Open sunjayBhatia opened this issue 1 year ago • 2 comments
trafficstars

Order of events:

  • Contour starts up
  • informers are initialized, endpoint translator is set up as event handler for endpoints
  • Event handler, etc. are started
  • OnAdd called for initial resources in cluster
    • for non-Endpoints, these are passed off to the event handler loop to be processed to build the dag
    • for Endpoints, these are handled per-call, recalculating the endpoints to send to Envoy based on recorded clusters and notifying any watchers (theoretically none since xDS server has not started, but technically we don't know if all the initial OnAdd calls finish before the server starts since we don't have any coordination here, so we could be sending notifications about endpoints before everything is ready)
  • Once the initial list completes and all resources are processed by event handler, dag is built
    • The Endpoint Translator is a dag Observer so it reacts to the dag build and recalculates endpoints to send to Envoy based on recorded clusters
    • At this point we don't actually know if the endpoint translator has seen and processed all of the endpoint resources in the cluster so it could be reacting to OnChange with an incomplete set of resources
    • We don't have an analogous process for the endpoint translator to poll for informer sync status, block the startup of the xDS server, etc. since it has no way of polling if all initial resources are synced from the informer and processed
  • Event handler HasBuiltInitialDag now returns true
  • xDS server can now start
    • Since on first request our version of the xDS server will immediately respond with whatever resources are available (without waiting for a change notification from the xds cache), we could be sending an incomplete set of endpoints

It is likely that because the event handler waits for all informers to sync (including the endpoint informer because of: https://github.com/projectcontour/contour/blob/ea4d4f964f23bfaa9f4e0b32a8027e15e23c88e8/cmd/contour/serve.go#L1277 and https://github.com/projectcontour/contour/blob/ea4d4f964f23bfaa9f4e0b32a8027e15e23c88e8/cmd/contour/serve.go#L582-L589) and all the rest of the non-endpoint resources to be seen/processed before building a dag and letting the xDS server start, everything will be fine in practice, but we don't have a guarantee.

sunjayBhatia avatar Jan 19 '24 20:01 sunjayBhatia

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Mar 20 '24 00:03 github-actions[bot]

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Apr 20 '24 00:04 github-actions[bot]

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Jul 10 '24 00:07 github-actions[bot]

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] avatar Aug 14 '24 00:08 github-actions[bot]