Support recovery after translation errors
Description:
Currently, EG may work in either FailOpen or FailClosed mode when encountering XDS translation errors. Errors can stem from invalid resources or transient failures (e.g. in connection to the extension server).
When failures occur, a best-effort XDS state is stored, which is likely disruptive to end users, either due to wiped-out routes and listeners or due to partially translated configuration.
EG should attempt to recover from failed translations. This can be accomplished by:
- Retrying translation (with delay and backoff)
- Retriggering translation when errors are fixed: malformed resources are fixed by users (already supported), connectivity to extension server is restored.
- Periodically reconciling XDS to fix entropy
Ideally, the message framework in EG should be used to trigger a recovery/retry translation. Components such as Extension Manager and XDS Translator should be able to trigger a retry/recover translation by publishing a relevant message.
[optional Relevant Links:]
Any extra documentation required to understand the issue.
+1 to adding retry with the extension manager client similar to the behavior of envoy proxy connecting to EG (endless retry with backoff)
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
I think we'd have to force a regeneration of all xDS IRs that depend on the extension manager every time a connection reestablishes with the extension server and ensuring that the connection is always live (e.g. periodic probing or reconciling as mentioned above) to deal with the tricky case of redeployments of the extension server due to an upgrade for example. As such, I think the only safe deployment model at the moment would be to deploy the extension server as a sidecar to the controller?