gateway fix: add isTransientError helper to classify retryable errors during reconcile

What type of PR is this?

fix

What this PR does / why we need it:

Introduces isTransientError to detect transient Kubernetes errors and enable proper reconciliation retries.

Which issue(s) this PR fixes:

Fixes https://github.com/envoyproxy/gateway/issues/6284

Release Notes: Yes/No

Jun 12 '25 18:06 patrostkowski

Codecov Report

Attention: Patch coverage is 27.44186% with 156 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (a394c61) to head (80b953e). Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/provider/kubernetes/controller.go	27.44%	150 Missing and 6 partials :warning:

:x: Your patch status has failed because the patch coverage (27.44%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6299      +/-   ##
==========================================
- Coverage   70.92%   70.73%   -0.19%     
==========================================
  Files         220      220              
  Lines       37260    37354      +94     
==========================================
- Hits        26426    26423       -3     
- Misses       9289     9383      +94     
- Partials     1545     1548       +3

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jun 13 '25 09:06 codecov[bot]

@arkodg, @patrostkowski My question here would be: In case of a non-transient error, we will still store inconsistent resources (partial? invalid?) that may lead to corruption of XDS. Is there a way to avoid this, e.g. by using previously-stored resources for the affected GWC if they exist? This would be more aligned and compatible with previous behavior, where ALL XDS is essentially not updated in case of ANY error in this area of provider?

Jun 13 '25 14:06 guydc

Notes from the community meeting:

Provider layer should publish an accurate state-of-the-world.
If transient errors make it impossible to build an accurate snapshot, the reconciliation should be retried by returning early with an error
If validation errors are encountered, state should be published, so that status can be updated and user notified about issues with resources.
If users want to implement alternative strategies for handling invalid configuration (e.g. pausing XDS updates), that should be implemented in other layers.

Jun 18 '25 16:06 guydc

can we also log the transient error

Jun 19 '25 01:06 arkodg

@patrostkowski still working on this ? we'd like to get this into the patch release scheduled to be out next week

Jun 24 '25 21:06 arkodg

It seems @patrostkowski is currently unavailable. I’ll take over from here since we need this in the upcoming patch.

Jun 26 '25 02:06 zhaohuabing

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

Jun 26 '25 09:06 patrostkowski

thanks @zhaohuabing and sorry, I am very busy time atm with other things 😞

No worries. I'll take care of it.

Jun 26 '25 09:06 zhaohuabing

/retest

Jun 30 '25 02:06 zhaohuabing

gateway gateway copied to clipboard

fix: add isTransientError helper to classify retryable errors during reconcile

Codecov Report

gateway
gateway copied to clipboard