envoy DNS resolution failure results in UH / no healthy upstream

Title: Observing DNS resolution timeout, resulting in UH at pod startup of istio proxy

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should successfully do dns resolutions, and come up with the cluster endpoints and prevent UH / no healthy upstream errors.

Repro steps:

Include sample requests, environment, etc. All data and inputs required to reproduce the bug.

Create 5000 strict dns clusters which point to AWS LB
Startup envoy with dns logs enabled in debug mode.
Should result in status=1, status=12 errors in dns resolution

Note: The Envoy_collect tool gathers a tarball with debug logs, config and the following admin endpoints: /stats, /clusters and /server_info. Please note if there are privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats, /clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

2024-01-11T20:18:14.819795Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:18:14.821345Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:18:14.822802Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:19:34.386036Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386045Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:34.386069Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386076Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:34.386093Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386099Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:35.653971Z	info	Envoy proxy is ready
2024-01-11T20:20:34.483098Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:34.517452Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:34.525608Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:39.483516Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 0 log_from_custom_dns_patch	thread=26 → First successful DNS resolution after Envoy marked itself as READY. It is 65 seconds after Envoy marked itself as ready.

Note: If there are privacy concerns, sanitize the data prior to sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required. Please refer to the Bazel Stack trace documentation.

Jan 24 '24 03:01 nirvanagit

We see a lot of timeouts, is there a way to configure timeouts? I see there is an option in the c-ares library ARES_OPT_TIMEOUT, but don't see an option to override that in envoy

Jan 24 '24 03:01 nirvanagit

Setting wait_for_warm_on_init on the cluster might help. I don't think there's a way to set c-ares' internal timeout at the moment.

cc @yanavlasov @mattklein123 as codeowners

Jan 25 '24 17:01 zuercher

wait_for_warm_on_init

@zuercher - this is set to true by default.

Jan 31 '24 16:01 nirvanagit

@lambdai / @howardjohn - will be glad if you can throw some light on this one.

Similar to https://github.com/envoyproxy/envoy/issues/20562

Feb 08 '24 17:02 nirvanagit

Appears that envoy is marking itself ready even when DNS resolution failed with the following error codes as per C-Ares: Error codes:

12 - ARES_ETIMEOUT
11 - ARES_ECONNREFUSED
16 - ARES_EDESTRUCTION

cc: @lambdai @alyssawilk @mattklein123

Feb 14 '24 16:02 nirvanagit

After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.

The experiment proved that the issue is rooted in the way Envoy does DNS resolutions.

My proposal to solve this issue/bug is three folds, each providing a fallback option if the previous one fails:

Envoy should batch DNS resolution queries. So that the socket / channel is not bombarded with thousands of requests at the same time
Envoy should categorize DNS failures as client side and server side, and for client side failures, Envoy should not mark itself as ready.
When envoy receives a request for a CLUSTER endpoint, which doesn't have any IPs because DNS resolution had failed, envoy should retry DNS resolution X number of times at runtime before giving up

Feb 15 '24 19:02 nirvanagit

When there are 2 STRICT DNS clusters with the same endpoint, envoy does 2 DNS resolutions.

Can the DNS resolution mechanism be optimized to avoid duplicate DNS resolutions? cc: @alyssawilk @mattklein123 @yuval-k

Feb 22 '24 15:02 nirvanagit

@zuercher one question on c-ares, let us say if we have 3 STRICT_DNS clusters, does c-ares open 3 persistent connections to upstream resolver?

Mar 01 '24 11:03 ramaraochavali

@nirvanagit are you setting the dns cache config? I think you should be able to aim 2 clusters at one cache and avoid the duplication

Mar 05 '24 15:03 alyssawilk

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Apr 04 '24 16:04 github-actions[bot]

@alyssawilk looks like they are using regular STRICT_DNS cluster not the dynamic forward proxy. So no caching involved here?

@mattklein123 @alyssawilk Can you please help answer this question and would dynamic forward proxy maintains a persistent connection for all look ups or does it tear down a connection after lookup? One of the problems we are seeing with this STRICT_DNS clusters is one of the core DNS pods is overwhelmed with lot of connections? Have you seen this?

Apr 09 '24 05:04 ramaraochavali

ah sorry I'm much more familiar with DFP than strict DNS.

would dynamic forward proxy maintains a persistent connection for all look ups

I don't even know what this means? for DFP we do a DNS lookup per hostname (DNS lookups are UDP and don't have connections associated), then cache the result until the TTL runs out at which point there's another lookup.
The persistent connections are the TCP connections upstream. If there's a new DNS resolution we'll continue using the connection (latched to the old address) until the connection is closed.

the DFP, as it uses the DNS cache, also supports stale DNS, so when DNS expires if resolution fails, you can configure the cache to use the last successful resolve result. Sounds like the problem is that strict DNS doesn't get any of these benefits - may be worth adding as an optional feature.

Apr 10 '24 13:04 alyssawilk

I don't even know what this means?

Persistent connection is wrong choice of word here. When envoy sends a DNS query to core DNS, what we have observed in some environments is it is always sending to one single pod of core DNS flooding that pod. Curious if there is some thing in Envoy/C-ares that is making it to choose the same pod when DNS lookups are done for multiple STRICT_DNS clusters.

Apr 11 '24 10:04 ramaraochavali

https://github.com/envoyproxy/envoy/issues/7965 - Found this and possible fix in c-ares https://github.com/c-ares/c-ares/pull/549. Is it OK add this configuration to DNSResolverConfig?

Apr 11 '24 10:04 ramaraochavali

@alyssawilk ^^ WDYT?

Apr 15 '24 03:04 ramaraochavali

If you've tested that this addresses your problem, SGTM. We could either add a perment knob or set a "sensible default" runtime guarded and add a knob if anyone dislikes the default.

Apr 23 '24 13:04 alyssawilk

https://github.com/envoyproxy/envoy/pull/33551 - adding a permanent knob here

Apr 24 '24 11:04 ramaraochavali

After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.

@nirvanagit how did you mange to set timeouts on dns_resolver_config ? is that for tcp ?

May 10 '24 20:05 rsarunprashad

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Jun 10 '24 00:06 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Jun 17 '24 00:06 github-actions[bot]

Hello @ramaraochavali, How did you apply this configuration to istio-proxy?

Jul 01 '24 23:07 virajrk

envoy envoy copied to clipboard

DNS resolution failure results in UH / no healthy upstream

envoy
envoy copied to clipboard