gateway icon indicating copy to clipboard operation
gateway copied to clipboard

fix: add isTransientError helper to classify retryable errors during reconcile

Open patrostkowski opened this issue 5 months ago • 4 comments

What type of PR is this?

fix

What this PR does / why we need it:

Introduces isTransientError to detect transient Kubernetes errors and enable proper reconciliation retries.

Which issue(s) this PR fixes:

Fixes https://github.com/envoyproxy/gateway/issues/6284

Release Notes: Yes/No

patrostkowski avatar Jun 12 '25 18:06 patrostkowski

Codecov Report

Attention: Patch coverage is 27.44186% with 156 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (a394c61) to head (80b953e). Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/controller.go 27.44% 150 Missing and 6 partials :warning:

:x: Your patch status has failed because the patch coverage (27.44%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6299      +/-   ##
==========================================
- Coverage   70.92%   70.73%   -0.19%     
==========================================
  Files         220      220              
  Lines       37260    37354      +94     
==========================================
- Hits        26426    26423       -3     
- Misses       9289     9383      +94     
- Partials     1545     1548       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 13 '25 09:06 codecov[bot]

@arkodg, @patrostkowski My question here would be: In case of a non-transient error, we will still store inconsistent resources (partial? invalid?) that may lead to corruption of XDS. Is there a way to avoid this, e.g. by using previously-stored resources for the affected GWC if they exist? This would be more aligned and compatible with previous behavior, where ALL XDS is essentially not updated in case of ANY error in this area of provider?

guydc avatar Jun 13 '25 14:06 guydc

Notes from the community meeting:

  • Provider layer should publish an accurate state-of-the-world.
  • If transient errors make it impossible to build an accurate snapshot, the reconciliation should be retried by returning early with an error
  • If validation errors are encountered, state should be published, so that status can be updated and user notified about issues with resources.
  • If users want to implement alternative strategies for handling invalid configuration (e.g. pausing XDS updates), that should be implemented in other layers.

guydc avatar Jun 18 '25 16:06 guydc

can we also log the transient error

arkodg avatar Jun 19 '25 01:06 arkodg

@patrostkowski still working on this ? we'd like to get this into the patch release scheduled to be out next week

arkodg avatar Jun 24 '25 21:06 arkodg

It seems @patrostkowski is currently unavailable. I’ll take over from here since we need this in the upcoming patch.

zhaohuabing avatar Jun 26 '25 02:06 zhaohuabing

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

patrostkowski avatar Jun 26 '25 09:06 patrostkowski

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

No worries. I'll take care of it.

zhaohuabing avatar Jun 26 '25 09:06 zhaohuabing

/retest

zhaohuabing avatar Jun 30 '25 02:06 zhaohuabing