gateway icon indicating copy to clipboard operation
gateway copied to clipboard

Support recovery after translation errors

Open guydc opened this issue 10 months ago • 3 comments

Description: Currently, EG may work in either FailOpen or FailClosed mode when encountering XDS translation errors. Errors can stem from invalid resources or transient failures (e.g. in connection to the extension server).

When failures occur, a best-effort XDS state is stored, which is likely disruptive to end users, either due to wiped-out routes and listeners or due to partially translated configuration.

EG should attempt to recover from failed translations. This can be accomplished by:

  • Retrying translation (with delay and backoff)
  • Retriggering translation when errors are fixed: malformed resources are fixed by users (already supported), connectivity to extension server is restored.
  • Periodically reconciling XDS to fix entropy

Ideally, the message framework in EG should be used to trigger a recovery/retry translation. Components such as Extension Manager and XDS Translator should be able to trigger a retry/recover translation by publishing a relevant message.

[optional Relevant Links:]

Any extra documentation required to understand the issue.

guydc avatar Mar 11 '25 14:03 guydc

+1 to adding retry with the extension manager client similar to the behavior of envoy proxy connecting to EG (endless retry with backoff)

arkodg avatar Mar 11 '25 21:03 arkodg

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Apr 11 '25 00:04 github-actions[bot]

I think we'd have to force a regeneration of all xDS IRs that depend on the extension manager every time a connection reestablishes with the extension server and ensuring that the connection is always live (e.g. periodic probing or reconciling as mentioned above) to deal with the tricky case of redeployments of the extension server due to an upgrade for example. As such, I think the only safe deployment model at the moment would be to deploy the extension server as a sidecar to the controller?

wtzhang23 avatar Dec 04 '25 16:12 wtzhang23