envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config

Open jparklab opened this issue 1 year ago • 6 comments

Title: Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config

Description:

Based on Resource warming section on the envoy documentation, envoy is expected to use the previously sent RouteConfiguration while warming up a Listener and management does not need to send the RouteConfiguration if there is no change. However, when a field inside Rds field in the Listener, including initial_fetch_timeout field, is changed in a Listener, Envoy does not use the previously sent RouteConfiguration and wait for management server for the RouteConfiguration.

This can cause Envoy to time out while waiting for the RouteConfiguration, and finishes Listener warming without the RouteConfiguration. Once a Listener is warmed up without RouteCofiguration, Envoy responds to requests to the route with 404(NR) responses until it is restarted or the RouteConfiguration is updated and management server sends the updated RouteConfiguration to Envoy.

This happens because Envoy does not use existing_provider in https://github.com/envoyproxy/envoy/blob/v1.26.6/source/common/rds/route_config_provider_manager.cc#L82 if the hash value of rds configuration changes which prevents Envoy from using previously sent RouteConfiguration

Can we update Envoy to use existing_provider when initial_fetch_timeout value is changed? Although we do not need to change it often, we sometimes need to change the value, and we want to avoid restarting envoy proxies whenever we need to update initial_fetch_timeout value.

Repro steps:

This can be reproduced by running an envoy proxy that uses ADS to fetch configurations from a management server, and change initial_fetch_timeout value in ConfigSource in a listener. I have a simple management server to reproduce the issue, and can provide it if helps.

jparklab avatar Feb 08 '24 20:02 jparklab

Curious, what is the reason for dynamically changing init_fetch_timeout ?

ramaraochavali avatar Feb 09 '24 08:02 ramaraochavali

Curious, what is the reason for dynamically changing init_fetch_timeout ?

We have multiple envoy proxies as edge proxies connected to the same management server receiving configurations for more than a few thousands of services, and we see a large number of fetch timeouts when we restart the management server since all of the envoy proxies reconnects. We want to adjust init_fetch_timeout to avoid fetch timeouts (we are also considering rearchitecturing, however, that's a longer term goal for us).

The parameter will be updated when we release the change to the management server, and envoy proxies will get the updated value dynamically when it fetches updated listener configurations

jparklab avatar Feb 09 '24 13:02 jparklab

@alyssawilk I think (as codeowner on router)

ravenblackx avatar Feb 12 '24 15:02 ravenblackx

I suspect this is more of an RDS issue so tagging @adisuissa for thoughts

alyssawilk avatar Feb 12 '24 16:02 alyssawilk

Yes, it seems that the identifier should be the unique resource name (+ what config-server used to serve it). The fix will be creating a unique-ID given the proto instead of hashing the entire proto.

adisuissa avatar Feb 12 '24 16:02 adisuissa

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 13 '24 20:03 github-actions[bot]

Hi, I am interested in working on this issue. Please let me know if this is still available.

I am new to this repository and would appreciate your sharing resources around the issue and starting pointers.

srivatsav1998 avatar Oct 01 '24 05:10 srivatsav1998