feat(EG K8S Provider): Improve EG Gateway xDS & startup reliability
What type of PR is this?
feat(EG K8S Provider): Improve EG Gateway xDS & startup reliability
What this PR does / why we need it:
- Service Readiness: By waiting for the xDS service to start up, we ensure that the service discovery mechanism is operational, which is crucial for dynamically managing service configurations and routing in microservices architectures.
- Stability and Reliability: This cautious approach prioritizes the stability and reliability of the system by preventing any unconfigured or misconfigured instances of Envoy from intercepting traffic, which could lead to service disruptions or degraded user experiences.
- Observability: Completing a successful reconcile cycle before startup can make it easier to monitor and debug the system, as it sets a clear baseline for the operational state of the Envoy instances.
Which issue(s) this PR fixes:
This enhancement facilitates the safer upgrades of EG gateways and features improved xDS consistency on startup.
High Level Changes
A custom health check is implemented to ensure that EG controller is deemed ready only once the xDS snapshot is persisted, or after at least one reconciliation is completed for empty or new clusters. A custom watcher is added to initiate an initial dummy reconciliation, this ensures empty or news deployments can start (as a successful reconcile is a trigger for startup)
- Configuration Validation: Ensuring a successful reconcile cycle before startup means that the initial configuration passed to Envoy proxies is validated and error-free. This minimizes the chances of runtime errors related to misconfigurations.
https://github.com/envoyproxy/gateway/issues/2810
! Draft !
Codecov Report
Attention: Patch coverage is 51.61290% with 30 lines in your changes are missing coverage. Please review.
Project coverage is 64.60%. Comparing base (
29946b0) to head (7afa2db). Report is 122 commits behind head on main.
:exclamation: Current head 7afa2db differs from pull request most recent head 1505793. Consider uploading reports for the commit 1505793 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #2918 +/- ##
==========================================
- Coverage 66.51% 64.60% -1.91%
==========================================
Files 161 123 -38
Lines 22673 21179 -1494
==========================================
- Hits 15080 13683 -1397
+ Misses 6720 6646 -74
+ Partials 873 850 -23
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@alexwo is this necessary ? based on https://github.com/envoyproxy/gateway/blob/main/internal/xds/cache/snapshotcache.go I believe we only push a snapshot if the IR exists, so if a IR hasn't been translated yet, it shouldnt result in a xds response (this needs to be cross checked)
@alexwo is this necessary ? based on https://github.com/envoyproxy/gateway/blob/main/internal/xds/cache/snapshotcache.go I believe we only push a snapshot if the IR exists, so if a IR hasn't been translated yet, it shouldnt result in a xds response (this needs to be cross checked)
Sure,
There are essentially two type of conditions that indicate that an EG instance is healthy to serve traffic:
1. Empty or new cluster: If there are no items requiring translation, we don't anticipate an XDS response. Our focus is solely on verifying that the XDS server has successfully initiated and that a reconciliation loop, albeit inactive, has been executed. In this case EG just startup without xDS having to persist anything.
To figure out that a cluster is empty or new and can start due to this conditions, we trigger the "initial reconcile loop" which will signal to start the instance if xDS / initial loop figures out there is nothing to do.
2. Cluster with resources for xDS: If there are any IRs translated due to a reconciliation loop containing items needing processing, we verify that a snapshot has indeed been persisted before deeming the instance ready. (In this case the reconcile ha items, so we wait until the loop complete and write the first snapshot, only after mark the instance as healthy.)
@alexwo You closed this PR in favor of @arkodg 's suggestion (https://github.com/envoyproxy/gateway/issues/2810#issuecomment-1981979019) to use longer initialDelaySeconds ?