envoy Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config

Title: Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config

Description:

Based on Resource warming section on the envoy documentation, envoy is expected to use the previously sent RouteConfiguration while warming up a Listener and management does not need to send the RouteConfiguration if there is no change. However, when a field inside Rds field in the Listener, including initial_fetch_timeout field, is changed in a Listener, Envoy does not use the previously sent RouteConfiguration and wait for management server for the RouteConfiguration.

This can cause Envoy to time out while waiting for the RouteConfiguration, and finishes Listener warming without the RouteConfiguration. Once a Listener is warmed up without RouteCofiguration, Envoy responds to requests to the route with 404(NR) responses until it is restarted or the RouteConfiguration is updated and management server sends the updated RouteConfiguration to Envoy.

This happens because Envoy does not use existing_provider in https://github.com/envoyproxy/envoy/blob/v1.26.6/source/common/rds/route_config_provider_manager.cc#L82 if the hash value of rds configuration changes which prevents Envoy from using previously sent RouteConfiguration

Can we update Envoy to use existing_provider when initial_fetch_timeout value is changed? Although we do not need to change it often, we sometimes need to change the value, and we want to avoid restarting envoy proxies whenever we need to update initial_fetch_timeout value.

Repro steps:

This can be reproduced by running an envoy proxy that uses ADS to fetch configurations from a management server, and change initial_fetch_timeout value in ConfigSource in a listener. I have a simple management server to reproduce the issue, and can provide it if helps.

Feb 08 '24 20:02 jparklab

Curious, what is the reason for dynamically changing init_fetch_timeout ?

Feb 09 '24 08:02 ramaraochavali

Curious, what is the reason for dynamically changing init_fetch_timeout ?

We have multiple envoy proxies as edge proxies connected to the same management server receiving configurations for more than a few thousands of services, and we see a large number of fetch timeouts when we restart the management server since all of the envoy proxies reconnects. We want to adjust init_fetch_timeout to avoid fetch timeouts (we are also considering rearchitecturing, however, that's a longer term goal for us).

The parameter will be updated when we release the change to the management server, and envoy proxies will get the updated value dynamically when it fetches updated listener configurations

Feb 09 '24 13:02 jparklab

@alyssawilk I think (as codeowner on router)

Feb 12 '24 15:02 ravenblackx

I suspect this is more of an RDS issue so tagging @adisuissa for thoughts

Feb 12 '24 16:02 alyssawilk

Yes, it seems that the identifier should be the unique resource name (+ what config-server used to serve it). The fix will be creating a unique-ID given the proto instead of hashing the entire proto.

Feb 12 '24 16:02 adisuissa

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Mar 13 '24 20:03 github-actions[bot]

Hi, I am interested in working on this issue. Please let me know if this is still available.

I am new to this repository and would appreciate your sharing resources around the issue and starting pointers.

Oct 01 '24 05:10 srivatsav1998

envoy envoy copied to clipboard

Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config

envoy
envoy copied to clipboard