gateway
gateway copied to clipboard
fix: add isTransientError helper to classify retryable errors during reconcile
What type of PR is this?
fix
What this PR does / why we need it:
Introduces isTransientError to detect transient Kubernetes errors and enable proper reconciliation retries.
Which issue(s) this PR fixes:
Fixes https://github.com/envoyproxy/gateway/issues/6284
Release Notes: Yes/No
Codecov Report
Attention: Patch coverage is 27.44186% with 156 lines in your changes missing coverage. Please review.
Project coverage is 70.73%. Comparing base (
a394c61) to head (80b953e). Report is 7 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| internal/provider/kubernetes/controller.go | 27.44% | 150 Missing and 6 partials :warning: |
:x: Your patch status has failed because the patch coverage (27.44%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files
@@ Coverage Diff @@
## main #6299 +/- ##
==========================================
- Coverage 70.92% 70.73% -0.19%
==========================================
Files 220 220
Lines 37260 37354 +94
==========================================
- Hits 26426 26423 -3
- Misses 9289 9383 +94
- Partials 1545 1548 +3
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@arkodg, @patrostkowski My question here would be: In case of a non-transient error, we will still store inconsistent resources (partial? invalid?) that may lead to corruption of XDS. Is there a way to avoid this, e.g. by using previously-stored resources for the affected GWC if they exist? This would be more aligned and compatible with previous behavior, where ALL XDS is essentially not updated in case of ANY error in this area of provider?
Notes from the community meeting:
- Provider layer should publish an accurate state-of-the-world.
- If transient errors make it impossible to build an accurate snapshot, the reconciliation should be retried by returning early with an error
- If validation errors are encountered, state should be published, so that status can be updated and user notified about issues with resources.
- If users want to implement alternative strategies for handling invalid configuration (e.g. pausing XDS updates), that should be implemented in other layers.
can we also log the transient error
@patrostkowski still working on this ? we'd like to get this into the patch release scheduled to be out next week
It seems @patrostkowski is currently unavailable. I’ll take over from here since we need this in the upcoming patch.
thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞
thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞
No worries. I'll take care of it.
/retest