envoy
envoy copied to clipboard
[Question] Some questions regarding the mechanism of service discovery in Envoy proxy
If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing [email protected] where the issue will be triaged appropriately.
Title: Some questions regarding the mechanism of service discovery in Envoy proxy
Description:
Describe the issue.
Hey Envoy community, this is Jason from the Shopify Caching platform team. We have some questions regarding the mechanism of Envoy Service Discovery.
Here are a few contexts about our architecture. We built an Envoy proxy in production for routing the traffic to Redis. There is also an xDS (gRPC) server that collects the IP addresses of KeyDB to build a snapshot and serve the requests from Envoy proxy. Each Redis node has both primary and secondary pods. The primary serving real traffic, then we use Redis PSYNC to replicate the offset. We also have a mechanism to failover primary to secondary when the primary is not responding to the health checks (INFO
commands).
This architecture works fine in most cases. Recently we faced some challenges. When the failover happens, there is a small possibility that the Envoy cannot catch up with the latest IPs from xDS. We checked the configuration during the incident, the xDS has the correct snapshot but Envoy didn’t get it, so a lot of timeouts from our clients because of stale IP addresses. Envoy will stuck in the stale IPs forever until to restart the problematic Envoy proxies.
I used a diagram to describe the situation when the incident happens. When there is a failover event in Redis cluster, step 1 works as expected and ZK has the refreshed IPs; xDS also get the latest IPs from Zookeeper to build snapshot; however, the problem happens in step 3 and step 4, Envoy doesn’t get the refreshed snapshot from xDS because of unknown reason, hence Envoy route the traffic to stale IPs which caused a lot of timeout errors in our clients.

Here are my questions:
- How does Envoy and xDS connect with each other? Do they communicate with each other through long connections? Does Envoy send gRPC requests to xDS periodically? If so, what's the frequency? How could we check if the connection between Envoy and xDS is still going through?
- Is there a mechanism in Envoy that checks if it can talk with a xDS (like a health check)? Based on what we observed during the incident, do you have some suggestions for us to debug the issue?
I attached one problematic Envoy debug logs and xDS logs during the incident to help understand the issue. Any reply will be really appreciated. Thank you!
(optional) Relevant Links:
Any extra documentation required to understand the issue.
bad_proxy_logs_aug11_pod10248.txt logs.xds-8f45448b7-gsd8t.txt
How does Envoy and xDS connect with each other?
By a long connection.
Does Envoy send gRPC requests to xDS periodically?
No. Envoy will send a stream request at the start of connection.
Is there a mechanism in Envoy that checks if it can talk with a xDS?
As far as I know, there is no special check for it. But the tcp keepalive can be used to ensure the connection is healthy.
I have encountered similar problem before. But I am not sure if it's same issue with yours.
My issue is caused by the the lost of connection contract. I configured a 300s keepalive time for the xds connection to ensure the connection keep active.
clusters:
- name: xds_grpc
upstream_connection_options:
tcp_keepalive:
keepalive_time: 300
@wbpcode how did you come up with the 300s keepalive time? It seems significantly shorter than the Linux default (2h).
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.