envoy
envoy copied to clipboard
DNS resolution failure results in UH / no healthy upstream
Title: Observing DNS resolution timeout, resulting in UH at pod startup of istio proxy
Description:
What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should successfully do dns resolutions, and come up with the cluster endpoints and prevent UH / no healthy upstream errors.
Repro steps:
Include sample requests, environment, etc. All data and inputs required to reproduce the bug.
- Create 5000 strict dns clusters which point to AWS LB
- Startup envoy with dns logs enabled in debug mode.
- Should result in status=1, status=12 errors in dns resolution
Note: The Envoy_collect tool gathers a tarball with debug logs, config and the following admin endpoints: /stats, /clusters and /server_info. Please note if there are privacy concerns, sanitize the data prior to sharing the tarball/pasting.
Admin and Stats Output:
Include the admin output for the following endpoints: /stats, /clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation.
Note: If there are privacy concerns, sanitize the data prior to sharing.
Config:
Include the config used to configure Envoy.
Logs:
Include the access logs and the Envoy logs.
2024-01-11T20:18:14.819795Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started thread=26
2024-01-11T20:18:14.821345Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started thread=26
2024-01-11T20:18:14.822802Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started thread=26
2024-01-11T20:19:34.386036Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12 thread=26
2024-01-11T20:19:34.386045Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch thread=26
2024-01-11T20:19:34.386069Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12 thread=26
2024-01-11T20:19:34.386076Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch thread=26
2024-01-11T20:19:34.386093Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12 thread=26
2024-01-11T20:19:34.386099Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch thread=26
2024-01-11T20:19:35.653971Z info Envoy proxy is ready
2024-01-11T20:20:34.483098Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started thread=26
2024-01-11T20:20:34.517452Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started thread=26
2024-01-11T20:20:34.525608Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started thread=26
2024-01-11T20:20:39.483516Z dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 0 log_from_custom_dns_patch thread=26 → First successful DNS resolution after Envoy marked itself as READY. It is 65 seconds after Envoy marked itself as ready.
Note: If there are privacy concerns, sanitize the data prior to sharing.
Call Stack:
If the Envoy binary is crashing, a call stack is required. Please refer to the Bazel Stack trace documentation.
We see a lot of timeouts, is there a way to configure timeouts? I see there is an option in the c-ares library ARES_OPT_TIMEOUT, but don't see an option to override that in envoy
Setting wait_for_warm_on_init
on the cluster might help. I don't think there's a way to set c-ares' internal timeout at the moment.
cc @yanavlasov @mattklein123 as codeowners
wait_for_warm_on_init
@zuercher - this is set to true by default.
@lambdai / @howardjohn - will be glad if you can throw some light on this one.
Similar to https://github.com/envoyproxy/envoy/issues/20562
Appears that envoy is marking itself ready even when DNS resolution failed with the following error codes as per C-Ares: Error codes:
- 12 - ARES_ETIMEOUT
- 11 - ARES_ECONNREFUSED
- 16 - ARES_EDESTRUCTION
cc: @lambdai @alyssawilk @mattklein123
After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.
The experiment proved that the issue is rooted in the way Envoy does DNS resolutions.
My proposal to solve this issue/bug is three folds, each providing a fallback option if the previous one fails:
- Envoy should batch DNS resolution queries. So that the socket / channel is not bombarded with thousands of requests at the same time
- Envoy should categorize DNS failures as client side and server side, and for client side failures, Envoy should not mark itself as ready.
- When envoy receives a request for a CLUSTER endpoint, which doesn't have any IPs because DNS resolution had failed, envoy should retry DNS resolution X number of times at runtime before giving up
When there are 2 STRICT DNS clusters with the same endpoint, envoy does 2 DNS resolutions.
Can the DNS resolution mechanism be optimized to avoid duplicate DNS resolutions? cc: @alyssawilk @mattklein123 @yuval-k
@zuercher one question on c-ares, let us say if we have 3 STRICT_DNS clusters, does c-ares open 3 persistent connections to upstream resolver?
@nirvanagit are you setting the dns cache config? I think you should be able to aim 2 clusters at one cache and avoid the duplication
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
@alyssawilk looks like they are using regular STRICT_DNS cluster not the dynamic forward proxy. So no caching involved here?
@mattklein123 @alyssawilk Can you please help answer this question and would dynamic forward proxy maintains a persistent connection for all look ups or does it tear down a connection after lookup? One of the problems we are seeing with this STRICT_DNS clusters is one of the core DNS pods is overwhelmed with lot of connections? Have you seen this?
ah sorry I'm much more familiar with DFP than strict DNS.
would dynamic forward proxy maintains a persistent connection for all look ups
I don't even know what this means?
for DFP we do a DNS lookup per hostname (DNS lookups are UDP and don't have connections associated), then cache the result until the TTL runs out at which point there's another lookup.
The persistent connections are the TCP connections upstream. If there's a new DNS resolution we'll continue using the connection (latched to the old address) until the connection is closed.
the DFP, as it uses the DNS cache, also supports stale DNS, so when DNS expires if resolution fails, you can configure the cache to use the last successful resolve result. Sounds like the problem is that strict DNS doesn't get any of these benefits - may be worth adding as an optional feature.
I don't even know what this means?
Persistent connection is wrong choice of word here. When envoy sends a DNS query to core DNS, what we have observed in some environments is it is always sending to one single pod of core DNS flooding that pod. Curious if there is some thing in Envoy/C-ares that is making it to choose the same pod when DNS lookups are done for multiple STRICT_DNS clusters.
https://github.com/envoyproxy/envoy/issues/7965 - Found this and possible fix in c-ares https://github.com/c-ares/c-ares/pull/549. Is it OK add this configuration to DNSResolverConfig?
@alyssawilk ^^ WDYT?
If you've tested that this addresses your problem, SGTM. We could either add a perment knob or set a "sensible default" runtime guarded and add a knob if anyone dislikes the default.
https://github.com/envoyproxy/envoy/pull/33551 - adding a permanent knob here
After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.
@nirvanagit how did you mange to set timeouts on dns_resolver_config ? is that for tcp ?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Hello @ramaraochavali, How did you apply this configuration to istio-proxy?