uptime-kuma
uptime-kuma copied to clipboard
[Bug]: DNS queryA Fails
👟 Reproduction steps
Setup a DNS Monitor using the default CloudFlare Resolver Server of 1.1.1.1
👍 Expected behavior
Monitor shouldn't trigger as DOWN regularly when the actual domain's DNS is resolving just fine.
To better diagnose the underlying problem I setup a nearly identical UK DNS monitor using Google DNS (8.8.8.8/8.8.4.4), and no UK incidents have been experienced seen since! The other added bonus - Google DNS seems to support 'ANY/ALL' DNS queries whereas CloudFlare does not, meaning we have a way to gather most of the DNS record types for the domain.
👎 Actual Behavior
UK frequently detects the domain's DNS A record as DOWN with the message:
queryA ESERVFAIL domain.com
We have many A Record DNS Monitors in place for multiple domain names; experienced this across all of them.
🐻 Uptime-Kuma version
1.9.1
💻 Operating System
Ubuntu 20.04
🌐 Browser
Any
🐋 Docker
N/A
🏷️ Docker Image Tag
N/A
🟩 NodeJS Version
14.8.1
📝 Relevant log output
Up 2021-10-31 01:16:24 Records: 123.123.123.123
Down 2021-10-31 01:15:01 queryA ESERVFAIL domain.com
Up 2021-10-30 19:24:56 Records: 123.123.123.123
Down 2021-10-30 19:23:32 queryA ESERVFAIL domain.com
Up 2021-10-30 15:42:27 Records: 123.123.123.123
Down 2021-10-30 15:41:04 queryA ESERVFAIL domain.com
Up 2021-10-30 12:49:59 Records: 123.123.123.123
Down 2021-10-30 12:48:35 queryA ESERVFAIL domain.com
⚠️ Please verify that this bug has NOT been raised before.
- [X] I checked and didn't find similar issue
🛡️ Security Policy
- [X] I agree to have read this project Security Policy
I cannot reproduce with 1.1.1.1
using Google DNS (8.8.8.8/8.8.4.4), and no UK incidents have been experienced seen since!
Sounds like it is your network issue between 1.1.1.1
@louislam - I appreciate your time at looking at this further so quickly. I also found it strange that one of the worlds largest DNS providers (CloudFlare) had these sort of recurring issues (simple A name lookup!), and still scratching my head as to why the UK dns_resolver setting had such a positive impact after switching to Google DNS. Had both going as their own UK monitors every minute for days, and was getting random yet daily DOWN notifications only for the CloudFlare based monitors. Another finer detail here - I am running UK on an AWS medium sized EC2 instance - maybe the fact that its on Amazon plays a role with this.
ALL - if you have experienced similar issues, please chime in here!
If you are running a large number of DNS monitors, did you test what happens if you switch all of them to 8.8.8.8? In theory dns.resolve()
should not be overloaded so easily because it's async, but there might be something in the networking stack that's reusing the connection, or maybe it's cloudflare that's implementing a rate limit.
@chakflying - Running only a handful of DNS monitors overall, and all have been reconfigured to use 8.8.8.8 Started to notice the resolution problems with only 2 at the time against 1.1.1.1.
I also have experienced the same issue. I also thought it was something with 1.1.1.1, so I switched all my DNS monitors (2 of them) to 8.8.8.8 as the resolver. The problem went away.
- The Heartbeat Interval is 60 seconds, same as Heartbeat Retry Interval
- Retries is 0
I'm not discounting potential networking issues and the Uptime Server is hosted on a dedicated machine in DigitalOcean.

same issue. checked and all my DNS servers are live.
I started getting this after release 18. The only change in the monitor code was the dns cache. I'm using an internal DNS with a ton of monitors, but only three specific monitors for Apache solr are failing. Other sites monitored on the same server resolve properly.
Wonder if it is because of the port or something? The failing urls are like http://server:8983/solr/
and passing urls are like http://dns-on-same-server
.
I tried adding the server name to the host file but no luck.
Any other ideas?
https://github.com/louislam/uptime-kuma/commit/2073f0c28476bb46fb953ecefb9622273e8819d9
@louislam for my special case, I changed from server name to ip address and it works. I supposed because my server name is not the A
record on the dns. I wonder if something changed in node dns resolve function to make this happen because the changes in v18 do not seem to be related to how the dns is resolved.
@louislam for my special case, I changed from server name to ip address and it works. I supposed because my server name is not the
A
record on the dns. I wonder if something changed in node dns resolve function to make this happen because the changes in v18 do not seem to be related to how the dns is resolved.
You mean Node.js v18 or Uptime Kuma 1.18.0?
My custom dns with port is working fine. I may need more info.
Yeah, its odd, I tried it on my server and it doesn't work but from my laptop no problem. I tried added the server name to the uptime server host file, but still no luck. I was refering to Kuma 1.18, but I don't see how any changes in kuma would have changed my server lookup... and only for the one server. I reference TCP pings by server name and they all work. Maybe its a fluke.
I had a few other monitors one like this that started failing w/ the queryA ESERVFAIL
and left the server rebooted. I left them and after 1 day they went away. There must be some other cache/matching that happens elsewhere causing it for me... I did reset the server dns cache (which is also probably what happened when the server rebooted).
I have the same Issue, starting with kuma version 1.18, I got queryA ESERVFAIL
for all hostnames, that aren't on public dns-servers, but only on our own Windows-DNS-server. I tried using the 1.17 image and in this version it's working, kuma can resolve all hostnames.
The problem started a week ago and never healed itself
Do you have a mix of public/not public sites?
I wonder if it is because the cached lookup key is based on the options (maxCachedSessions: 0
) and could maybe be based on something that is more unique to the monitor? From the new cache code, it looks like the agent is shared among all the monitors now whereas before it was unique to a monitor. Maybe monitor ID can be added to the cache key?
Here's the code that changed in the last release. Not much changed, but I'm wondering if it is because now the agent is shared whereas before it was not? I'm not a subject matter expert though. https://github.com/louislam/uptime-kuma/commit/2073f0c28476bb46fb953ecefb9622273e8819d9
What do you think @louislam ?
Yes, I have a mix of public and non-public sites. Public sites worked all the time, non-public didn't work in 1.18
But I just tested another thing in 1.18 with non-public hosts: I added the windows domain-name to the URL and now kuma can resolve the hostname. So http://web1
is not working, but http://web1.mydomain.example.com
is working.
In my case it's enough knowing this, I don't need to resolve the hostname without the domain, I'm ok with adding the domain to all my hosts.
Do you have a mix of public/not public sites? I wonder if it is because the cached lookup key is based on the options (
maxCachedSessions: 0
) and could maybe be based on something that is more unique to the monitor? From the new cache code, it looks like the agent is shared among all the monitors now whereas before it was unique to a monitor. Maybe monitor ID can be added to the cache key?Here's the code that changed in the last release. Not much changed, but I'm wondering if it is because now the agent is shared whereas before it was not? I'm not a subject matter expert though. 2073f0c
What do you think @louislam ?
I added cacheable-lookup
into Uptime Kuma, so it will cache dns records.
Windows-DNS-server
@ljurk Do you mean the DNS Server that can be installed in Windows Server? I may need a proper step in order to reproduce the issue.
I'm just wondering if they problem w/ the short names is that the cached dns record is shared over every monitor using the same connection options? Should that key be more complex (include the ID of the monitor for example)?
I'm just wondering if they problem w/ the short names is that the cached dns record is shared over every monitor using the same connection options? Should that key be more complex (include the ID of the monitor for example)?
I don't think so, because under same agent options, http agent is reusable. HTTP agent is not specified for only one domain.
You can see the example in https://github.com/szmarczak/cacheable-lookup#attaching-cacheablelookup-to-an-agent
And so far, I do not receive large amount of similar bug reports, so I assumed that it should be very specific issues like @ljurk said, he is using Windows DNS Server
@louislam Yeah, I'm inside a windows domain. The domain controller is used for DNS and is running Windows Server. My docker host is running Ubuntu, it gets the dns-ip via dhcp and I didn't change any dns related stuff.
I have a similar issue, if not the same issue. My current setup has a PiHole operating as a DNS server where I have defined DNS entries; my Raspberry Pi has its DNS configured to go to the PI for all DNS inquiries. This works fine for all cases to resolve a locally defined address.
I can ping the address in the uptime container, and it resolves fine, but when using the name in uptime, it gives me a "queryAaaa ESERVFAIL" error. But when I use the static IP address it works fine for the status monitoring of my HTTP site.
I'm seeing the same behavior.
Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.
I have noticed something, when I have internal names that are even numbered, i.e. aaa.bbb.ccc.ddd or aaa.bbb, it does not resolve, but when it is an odd number of elements for the URI it tends to resolve. aaa.bbb.ccc.ddd.eee or aaa.bbb.ccc. This is not 100%, but it has helped me get some items registered by the DNS entry vs IP, which I prefer the DNS entry.
Cacheable-lookup is not working properly in some cases. With 1.19.x, DNS cache now could be disabled in Settings.
I'm seeing the same behavior.
Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.
Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you
I'm seeing the same behavior. Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.
Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you
Are they in the same docker network?
For me they are on different machines, and I am.using pihole for DNS. In order to get it to resolve I originally had ui.app.domain.com, this didn't resolve for uptime, but when I changed the name to app.domain.com it worked, and also logs.api.app.domain.com would work because it has five parts.["logs","api","app","domain","com"]
I'm seeing the same behavior. Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.
Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you
Are they in the same docker network?
No, pihole: 172.18.0.7 (docker IP) u. kuma: 172.16.0.4 (docker IP)
but I use the ip of the host (192.168.1.2) in the same way that other monitors of other containers have the same ip to monitor with ping, etc
For me they are on different machines, and I am.using pihole for DNS. In order to get it to resolve I originally had ui.app.domain.com, this didn't resolve for uptime, but when I changed the name to app.domain.com it worked, and also logs.api.app.domain.com would work because it has five parts.["logs","api","app","domain","com"]
I only use ip addresses.
i seem to have the same problem. they're also on the same docker network (also tried separating it and same problem)
this is my configuration. (I have kept 20s for sake of testing)
i seem to have the same problem. they're also on the same docker network (also tried separating it and same problem) this is my configuration.
i kept 20seconds for the sake of testing
What we need is a CURRENT, publicly accessible (=reproducible) testcase.
The first part of this issue was resolved when we switchd from cachable lookup to NSCD (Name Service Cache Daemon)
The other comments are likely unrelated to the first one. I think continuing this issue in less messy smaller issues (=issues which allow reproduction) is more productive rather than piling onto a resolved issue => closing as resolved