envoy
envoy copied to clipboard
Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config
Title: Envoy does not use previously sent RouteConfiguration when initial_fetch_timeout value is changed inside Rds config
Description:
Based on Resource warming section on the envoy documentation, envoy is expected to use the previously sent RouteConfiguration
while warming up a Listener
and management does not need to send the RouteConfiguration
if there is no change. However, when a field inside Rds
field in the Listener
, including initial_fetch_timeout
field, is changed in a Listener
, Envoy does not use the previously sent RouteConfiguration
and wait for management server for the RouteConfiguration
.
This can cause Envoy to time out while waiting for the RouteConfiguration
, and finishes Listener
warming without the RouteConfiguration
. Once a Listener
is warmed up without RouteCofiguration
, Envoy responds to requests to the route with 404(NR) responses until it is restarted or the RouteConfiguration
is updated and management server sends the updated RouteConfiguration
to Envoy.
This happens because Envoy does not use existing_provider
in https://github.com/envoyproxy/envoy/blob/v1.26.6/source/common/rds/route_config_provider_manager.cc#L82 if the hash value of rds configuration changes which prevents Envoy from using previously sent RouteConfiguration
Can we update Envoy to use existing_provider
when initial_fetch_timeout
value is changed? Although we do not need to change it often, we sometimes need to change the value, and we want to avoid restarting envoy proxies whenever we need to update initial_fetch_timeout
value.
Repro steps:
This can be reproduced by running an envoy proxy that uses ADS to fetch configurations from a management server, and change initial_fetch_timeout value in ConfigSource in a listener. I have a simple management server to reproduce the issue, and can provide it if helps.
Curious, what is the reason for dynamically changing init_fetch_timeout
?
Curious, what is the reason for dynamically changing
init_fetch_timeout
?
We have multiple envoy proxies as edge proxies connected to the same management server receiving configurations for more than a few thousands of services, and we see a large number of fetch timeouts when we restart the management server since all of the envoy proxies reconnects. We want to adjust init_fetch_timeout
to avoid fetch timeouts (we are also considering rearchitecturing, however, that's a longer term goal for us).
The parameter will be updated when we release the change to the management server, and envoy proxies will get the updated value dynamically when it fetches updated listener
configurations
@alyssawilk I think (as codeowner on router)
I suspect this is more of an RDS issue so tagging @adisuissa for thoughts
Yes, it seems that the identifier should be the unique resource name (+ what config-server used to serve it). The fix will be creating a unique-ID given the proto instead of hashing the entire proto.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Hi, I am interested in working on this issue. Please let me know if this is still available.
I am new to this repository and would appreciate your sharing resources around the issue and starting pointers.