kong
kong copied to clipboard
DNS resolution is getting timed out when several parallel executeQuery is scheduled for same qname
Is there an existing issue for this?
- [X] I have searched the existing issues
Kong version ($ kong version
)
3.0
Current Behavior
When dns_client.resolve() is invoked(health checker, late limiter, any other plugins which requires DNS resolution), 'querying' step is executed several times for a same qname.
Example: "my-service.my-domain.com.cluster.local:1 - cache-miss/scheduled/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/5:my-service.my-domain.com removed/5:my-service.my-domain.com removed/querying/querying/querying/querying/dns server error: 3 name error"
Please refer the complete log lines below.
https://github.com/Kong/kong/blob/master/kong/runloop/balancer/targets.lua#L489
https://github.com/Kong/kong/blob/master/kong/resty/dns/client.lua#L745
Is this due to several concurrent requests causing several executeQuery operation scheduled at the same time?
https://github.com/Kong/kong/blob/master/kong/resty/dns/client.lua#L800
Whenever, there are duplicate 'querying' step is involved, subsequent operations are getting timed out causing entire DNS resolution to fail.
What is the impact of having hundreds/thousands of executeQuery operation starting in parallel? Is there any kong configuration to tweak this behavior?
As per nginx documentation for ngx.timer.at
timer callbacks run in the background and their running time will not add to any client request's response time, they can easily accumulate in the server and exhaust system resources due to either Lua programming mistakes or just too much client traffic. To prevent extreme consequences like crashing the Nginx server, there are built-in limitations on both the number of "pending timers" and the number of "running timers" in an Nginx worker process. The "pending timers" here mean timers that have not yet been expired and "running timers" are those whose user callbacks are currently running.
https://github.com/openresty/lua-nginx-module#ngxtimerat
There seems to be a nginx level configurations lua_max_pending_timers(default: 1024) and lua_max_running_timers(default: 256)
Is there a chance for leakage in Kong which is causing more timers created and hitting these limits?
Example log from health checker.
[warn] 1110#0: *5211274 [lua] targets.lua:492: queryDns(): querying dns for my-service.my-domain.com failed: dns client error: 101 empty record received. Tried [
"(short)my-service.my-domain.com:(na) - cache-miss",
"my-service.my-domain.com.kong-core.svc.cluster.local:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.svc.cluster.local:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.cluster.local:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.eu-west-1.compute.internal:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.kong-core.svc.cluster.local:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.svc.cluster.local:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.cluster.local:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.eu-west-1.compute.internal:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.kong-core.svc.cluster.local:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.svc.cluster.local:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"my-service.my-domain.com.cluster.local:1 - cache-miss/scheduled/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/5:my-service.my-domain.com removed/5:my-service.my-domain.com removed/querying/querying/querying/querying/dns server error: 3 name error",
"my-service.my-domain.com.eu-west-1.compute.internal:1 - cache-miss/in progress (sync)/dns server error: 3 name error",
"my-service.my-domain.com:1 - cache-miss/in progress (sync)/dns client error: 101 empty record received"],
client: 12.3.0.8, server: kong, request: "GET /my-resource", host: "my-service.my-domain.com"
Example log from rate limiter plugin.
[error] 1110#0: *1926473 [kong] handler.lua:143 [rate-limiting] failed to get usage: [cosocket] DNS resolution failed: dns client error: 101 empty record received. Tried:
["(short)redis-core-master.redis-core.svc.cluster.local:(na) - cache-miss",
"redis-core-master.redis-core.svc.cluster.local.kong-core.svc.cluster.local:1 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.svc.cluster.local:1 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.cluster.local:1 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.eu-west-1.compute.internal:1 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local:1 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.kong-core.svc.cluster.local:33 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.svc.cluster.local:33 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.cluster.local:33 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.eu-west-1.compute.internal:33 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local:33 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.kong-core.svc.cluster.local:5 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.svc.cluster.local:5 - cache-miss/in progress (sync)/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout",
"redis-core-master.redis-core.svc.cluster.local.cluster.local:5 - cache-miss/in progress (sync)/querying/querying/querying/querying/try 1 error: failed to create a resolver: failed to set peer name: error/scheduled/try 2 error: failed to create a resolver: failed to set peer name: resource temporarily unavailable/dns lookup pool exceeded retries (1): failed to create a resolver: failed to set peer name: resource temporarily unavailable",
"redis-core-master.redis-core.svc.cluster.local.eu-west-1.compute.internal:5 - cache-hit/dns server error: 3 name error","redis-core-master.redis-core.svc.cluster.local:5 - cache-hit/dns client error: 101 empty record received"],
client: 4.24.5.27, server: kong, request: "POST /my-resource", host: "my-service.my-domain.com"
Expected Behavior
-
When dns_client.resolve() is invoked for specific hostname(ex: my-service.my-domain.com), 'querying' step should not be executed several times for a same qname. Example:
"my-service.my-domain.com.cluster.local:1 - cache-miss/scheduled/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/5:my-service.my-domain.com removed/5:my-service.my-domain.com removed/querying/querying/querying/querying/dns server error: 3 name error"
-
DNS client should not leak timers and handle creating timers gracefully in high load environments
-
DNS queries should not time out because of leak in usage of nginx timers
Steps To Reproduce
This issue is seen in a high load environment deployed on Kubernetes with Kong version is 3.0 in DB less mode. I could not reproduce the behavior in other setup.
Anything else?
No response
Hello @surenraju-careem,
Could you edit the issue to explain what this means:
Example:
"my-service.my-domain.com.cluster.local:1 - cache-miss/scheduled/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/5:my-service.my-domain.com removed/5:my-service.my-domain.com removed/querying/querying/querying/querying/dns server error: 3 name error"
This expectation isn't fully correct:
- DNS queries should not time out
DNS queries can time out for any number of reasons (network failure for example) and in those cases we should time out. I believe the expectation is:
- DNS queries should not time out because of a massive amount of concurrent DNS queries on the same resource
Hi @kikito,
Thanks for looking into the issue. I updated the expectations sections, however not much difference from previous one.
DNS queries can time out for any number of reasons (network failure for example) and in those cases we should time out.
Yes, I do understand that DNS queries can timeout for various reasons. However, in this case specifically, for the DNS resolution of same host name(ex: my-service.my-domain.com
), DNS client starts, multiple nginx timer threads. In a high traffic environment, what I suspect is that there are lot of timer threads are pending due to leak in nginx timers and this is causing the timeout.
syncQuery method supposed to combine multiple requests for same hostname and query once. However, possibly due to some race condition, for the same host name, multiple nginx timers are started, hence cause this long log message with multiple querying
steps. This is becasue multiple timer threads are querying DNS for same hostname.
Example: "my-service.my-domain.com.cluster.local:1 - cache-miss/scheduled/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/querying/5:my-service.my-domain.com removed/5:my-service.my-domain.com removed/querying/querying/querying/querying/dns server error: 3 name error"
https://github.com/Kong/kong/blob/master/kong/resty/dns/client.lua#L800
This is exactly the place, DNS resolution is getting timed out. This is due to poolMaxWait
is exceeded.
https://github.com/Kong/kong/blob/master/kong/resty/dns/client.lua#L868
Hello @surenraju-careem , we don't know how to reproduce the issue you are having.
Are you sure your DNS configuration is correct?
The fact that you are seeing so many querying/querying/querying...
seems to suggest that there might be some sort of DNS Loop in your configuration. This is what happens, for example, when a pair of DNS servers to forward to each other. Or server A forwards to server B, server B to C, and C back to A.
Internal tracking: KAG-357 / KAG-762
Hi @kikito,
We run Kong 3.0 in Kubernetes. Previously, we had ndots to 5. For a hostname, it was searching through all the search domains.
It was also likely that first the DNS query was forwarded to Kubernetes CoreDNS and Route53 and so on as we were running Kong on Kubernetes and upstream services where hosted in multiple different AWS accounts behind route 53.
Thanks, Suren.
Hi @surenraju-careem
We run Kong 3.0 in Kubernetes. Previously, we had ndots to 5. For a hostname, it was searching through all the search domains.
It was also likely that first the DNS query was forwarded to Kubernetes CoreDNS and Route53 and so on as we were running Kong on Kubernetes and upstream services where hosted in multiple different AWS accounts behind route 53.
Does this mean that you have identified a possible DNS problem on your end, then?
Hi @kikito,
There is no recursive DNS queries as you suspect. But a DNS query getting routed to multiple DNS servers(coreDNS, Route53) are expected right?
If DNS server recursion(A -> B -> C -> A) was the issue in our environment, issue should be reproducible consistently in all the Kong pods all the time. In our case, issue was not consistent. Issue happens in peak traffic hours only on one or few Kong pods and Kong recovers from the issue after for example 15 minutes or 30 minutes or so.
I am just curious to understand what happens if several concurrent requests tried to resolve DNS right after DNS cache expiry. Is there a possibility in kong several concurrent requests causing several executeQuery timer operation scheduled at the same time and overwhelm the kong timer system?
https://github.com/Kong/kong/blob/master/kong/resty/dns/client.lua#L800
I tried to explain the issue faced in detail here.
any fixes for this ..i am also getting the same issue.. i am using 3.3.0 version
Hi @223311 , Unfortunately as of right now I don't have any updates on this one. We're prioritizing this internally and will look into it. I realize it's been some time since the internal tickets were created so I'll bring them up again for review.
Internal reference: FTI-5417
I'm hitting the same issue and by the looks of it it's related to DNS refresh as @surenraju-careem mentioned. In my case I have a lot of unique hosts that Kong calls to, that may expire at the same time which may lead to timer exhaustion caused by asyncQuery
. The default max_threads of timer-ng seems to be the root cause (Kong doesn't change the default when initializing). I guess this should be made configurable.
We discussed the DNS issue in Kong API Submit 2023. https://app.hopin.com/events/api-summit-by-kong/replay/cut_738e61b8-91f1-44e2-960d-5578412e8a91
Dear contributor, We're closing this issue as there hasn't been any update to it for a long time. If the issue is still relevant in the latest version, please feel free to reopen it. We're more than happy to revisit it again. Your contribution is greatly appreciated! Please have a look at our pledge to the community for more information. Sincerely, Kong Gateway Team
Hi all! I've just wanted to give you an update that recently we've been taking a deeper look at this issue and we have identified a problem with our dns client. It is a little bit complex so there's a two step solution to it - a quick workaround, and a broader fix.
The workaround is switching dns_no_sync
config option to on
. You can do it yourself in your config - nevertheless temporarily we've enabled it by default in 3.5.0 release: https://github.com/Kong/kong/pull/11869
However there's also a broader fix to this issue which is part of: https://github.com/Kong/kong/pull/11900. It's currently being reviewed & tested so unfortunately it probably won't fit into 3.5.0.
@surenraju-careem could you confirm that switching dns_no_sync
to on
alleviates the problem?
Hello @nowNick I confirm that switching dns_no_sync to on alleviates the problem. We're suffering the same issue.
Regards.
This issue is marked as stale because it has been open for 14 days with no activity.
Dear contributor,
We are automatically closing this issue because it has not seen any activity for three weeks. We're sorry that your issue could not be resolved. If any new information comes up that could help resolving it, please feel free to reopen it.
Your contribution is greatly appreciated!
Please have a look our pledge to the community for more information.
Sincerely, Your Kong Gateway team