uptime-kuma icon indicating copy to clipboard operation
uptime-kuma copied to clipboard

[Bug]: DNS queryA Fails

Open arch1v1st opened this issue 3 years ago • 19 comments

👟 Reproduction steps

Setup a DNS Monitor using the default CloudFlare Resolver Server of 1.1.1.1

👍 Expected behavior

Monitor shouldn't trigger as DOWN regularly when the actual domain's DNS is resolving just fine.

To better diagnose the underlying problem I setup a nearly identical UK DNS monitor using Google DNS (8.8.8.8/8.8.4.4), and no UK incidents have been experienced seen since! The other added bonus - Google DNS seems to support 'ANY/ALL' DNS queries whereas CloudFlare does not, meaning we have a way to gather most of the DNS record types for the domain.

👎 Actual Behavior

UK frequently detects the domain's DNS A record as DOWN with the message:

queryA ESERVFAIL domain.com

We have many A Record DNS Monitors in place for multiple domain names; experienced this across all of them.

🐻 Uptime-Kuma version

1.9.1

💻 Operating System

Ubuntu 20.04

🌐 Browser

Any

🐋 Docker

N/A

🏷️ Docker Image Tag

N/A

🟩 NodeJS Version

14.8.1

📝 Relevant log output

Up	2021-10-31 01:16:24	Records: 123.123.123.123
Down	2021-10-31 01:15:01	queryA ESERVFAIL domain.com
Up	2021-10-30 19:24:56	Records: 123.123.123.123
Down	2021-10-30 19:23:32	queryA ESERVFAIL domain.com
Up	2021-10-30 15:42:27	Records: 123.123.123.123
Down	2021-10-30 15:41:04	queryA ESERVFAIL domain.com
Up	2021-10-30 12:49:59	Records: 123.123.123.123
Down	2021-10-30 12:48:35	queryA ESERVFAIL domain.com

⚠️ Please verify that this bug has NOT been raised before.

  • [X] I checked and didn't find similar issue

🛡️ Security Policy

arch1v1st avatar Nov 02 '21 02:11 arch1v1st

I cannot reproduce with 1.1.1.1

using Google DNS (8.8.8.8/8.8.4.4), and no UK incidents have been experienced seen since!

Sounds like it is your network issue between 1.1.1.1

louislam avatar Nov 02 '21 02:11 louislam

@louislam - I appreciate your time at looking at this further so quickly. I also found it strange that one of the worlds largest DNS providers (CloudFlare) had these sort of recurring issues (simple A name lookup!), and still scratching my head as to why the UK dns_resolver setting had such a positive impact after switching to Google DNS. Had both going as their own UK monitors every minute for days, and was getting random yet daily DOWN notifications only for the CloudFlare based monitors. Another finer detail here - I am running UK on an AWS medium sized EC2 instance - maybe the fact that its on Amazon plays a role with this.

ALL - if you have experienced similar issues, please chime in here!

arch1v1st avatar Nov 02 '21 02:11 arch1v1st

If you are running a large number of DNS monitors, did you test what happens if you switch all of them to 8.8.8.8? In theory dns.resolve() should not be overloaded so easily because it's async, but there might be something in the networking stack that's reusing the connection, or maybe it's cloudflare that's implementing a rate limit.

chakflying avatar Nov 02 '21 03:11 chakflying

@chakflying - Running only a handful of DNS monitors overall, and all have been reconfigured to use 8.8.8.8 Started to notice the resolution problems with only 2 at the time against 1.1.1.1.

arch1v1st avatar Nov 02 '21 08:11 arch1v1st

I also have experienced the same issue. I also thought it was something with 1.1.1.1, so I switched all my DNS monitors (2 of them) to 8.8.8.8 as the resolver. The problem went away.

  • The Heartbeat Interval is 60 seconds, same as Heartbeat Retry Interval
  • Retries is 0

I'm not discounting potential networking issues and the Uptime Server is hosted on a dedicated machine in DigitalOcean.

image

kingforaday avatar Nov 10 '21 14:11 kingforaday

same issue. checked and all my DNS servers are live.

SteveD70 avatar Apr 19 '22 14:04 SteveD70

I started getting this after release 18. The only change in the monitor code was the dns cache. I'm using an internal DNS with a ton of monitors, but only three specific monitors for Apache solr are failing. Other sites monitored on the same server resolve properly.

Wonder if it is because of the port or something? The failing urls are like http://server:8983/solr/ and passing urls are like http://dns-on-same-server.

I tried adding the server name to the host file but no luck.

Any other ideas?

https://github.com/louislam/uptime-kuma/commit/2073f0c28476bb46fb953ecefb9622273e8819d9

christopherpickering avatar Sep 05 '22 13:09 christopherpickering

@louislam for my special case, I changed from server name to ip address and it works. I supposed because my server name is not the A record on the dns. I wonder if something changed in node dns resolve function to make this happen because the changes in v18 do not seem to be related to how the dns is resolved.

christopherpickering avatar Sep 05 '22 13:09 christopherpickering

@louislam for my special case, I changed from server name to ip address and it works. I supposed because my server name is not the A record on the dns. I wonder if something changed in node dns resolve function to make this happen because the changes in v18 do not seem to be related to how the dns is resolved.

You mean Node.js v18 or Uptime Kuma 1.18.0?

My custom dns with port is working fine. I may need more info. image

louislam avatar Sep 05 '22 15:09 louislam

Yeah, its odd, I tried it on my server and it doesn't work but from my laptop no problem. I tried added the server name to the uptime server host file, but still no luck. I was refering to Kuma 1.18, but I don't see how any changes in kuma would have changed my server lookup... and only for the one server. I reference TCP pings by server name and they all work. Maybe its a fluke.

christopherpickering avatar Sep 05 '22 16:09 christopherpickering

I had a few other monitors one like this that started failing w/ the queryA ESERVFAIL and left the server rebooted. I left them and after 1 day they went away. There must be some other cache/matching that happens elsewhere causing it for me... I did reset the server dns cache (which is also probably what happened when the server rebooted).

christopherpickering avatar Sep 09 '22 08:09 christopherpickering

I have the same Issue, starting with kuma version 1.18, I got queryA ESERVFAIL for all hostnames, that aren't on public dns-servers, but only on our own Windows-DNS-server. I tried using the 1.17 image and in this version it's working, kuma can resolve all hostnames. The problem started a week ago and never healed itself

ljurk avatar Sep 12 '22 11:09 ljurk

Do you have a mix of public/not public sites? I wonder if it is because the cached lookup key is based on the options (maxCachedSessions: 0) and could maybe be based on something that is more unique to the monitor? From the new cache code, it looks like the agent is shared among all the monitors now whereas before it was unique to a monitor. Maybe monitor ID can be added to the cache key?

Here's the code that changed in the last release. Not much changed, but I'm wondering if it is because now the agent is shared whereas before it was not? I'm not a subject matter expert though. https://github.com/louislam/uptime-kuma/commit/2073f0c28476bb46fb953ecefb9622273e8819d9

What do you think @louislam ?

christopherpickering avatar Sep 12 '22 11:09 christopherpickering

Yes, I have a mix of public and non-public sites. Public sites worked all the time, non-public didn't work in 1.18 But I just tested another thing in 1.18 with non-public hosts: I added the windows domain-name to the URL and now kuma can resolve the hostname. So http://web1 is not working, but http://web1.mydomain.example.com is working. In my case it's enough knowing this, I don't need to resolve the hostname without the domain, I'm ok with adding the domain to all my hosts.

ljurk avatar Sep 12 '22 13:09 ljurk

Do you have a mix of public/not public sites? I wonder if it is because the cached lookup key is based on the options (maxCachedSessions: 0) and could maybe be based on something that is more unique to the monitor? From the new cache code, it looks like the agent is shared among all the monitors now whereas before it was unique to a monitor. Maybe monitor ID can be added to the cache key?

Here's the code that changed in the last release. Not much changed, but I'm wondering if it is because now the agent is shared whereas before it was not? I'm not a subject matter expert though. 2073f0c

What do you think @louislam ?

I added cacheable-lookup into Uptime Kuma, so it will cache dns records.

Windows-DNS-server

@ljurk Do you mean the DNS Server that can be installed in Windows Server? I may need a proper step in order to reproduce the issue.

louislam avatar Sep 12 '22 14:09 louislam

I'm just wondering if they problem w/ the short names is that the cached dns record is shared over every monitor using the same connection options? Should that key be more complex (include the ID of the monitor for example)?

image

christopherpickering avatar Sep 12 '22 14:09 christopherpickering

I'm just wondering if they problem w/ the short names is that the cached dns record is shared over every monitor using the same connection options? Should that key be more complex (include the ID of the monitor for example)?

image

I don't think so, because under same agent options, http agent is reusable. HTTP agent is not specified for only one domain.

You can see the example in https://github.com/szmarczak/cacheable-lookup#attaching-cacheablelookup-to-an-agent

And so far, I do not receive large amount of similar bug reports, so I assumed that it should be very specific issues like @ljurk said, he is using Windows DNS Server

louislam avatar Sep 12 '22 14:09 louislam

@louislam Yeah, I'm inside a windows domain. The domain controller is used for DNS and is running Windows Server. My docker host is running Ubuntu, it gets the dns-ip via dhcp and I didn't change any dns related stuff.

ljurk avatar Sep 12 '22 14:09 ljurk

I have a similar issue, if not the same issue. My current setup has a PiHole operating as a DNS server where I have defined DNS entries; my Raspberry Pi has its DNS configured to go to the PI for all DNS inquiries. This works fine for all cases to resolve a locally defined address.

I can ping the address in the uptime container, and it resolves fine, but when using the name in uptime, it gives me a "queryAaaa ESERVFAIL" error. But when I use the static IP address it works fine for the status monitoring of my HTTP site.

dnldpavlik avatar Dec 09 '22 00:12 dnldpavlik

I'm seeing the same behavior.

Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

kevin7s-io avatar Jan 25 '23 13:01 kevin7s-io

I have noticed something, when I have internal names that are even numbered, i.e. aaa.bbb.ccc.ddd or aaa.bbb, it does not resolve, but when it is an odd number of elements for the URI it tends to resolve. aaa.bbb.ccc.ddd.eee or aaa.bbb.ccc. This is not 100%, but it has helped me get some items registered by the DNS entry vs IP, which I prefer the DNS entry.

dnldpavlik avatar Jan 25 '23 13:01 dnldpavlik

Cacheable-lookup is not working properly in some cases. With 1.19.x, DNS cache now could be disabled in Settings.

louislam avatar Jan 25 '23 15:01 louislam

I'm seeing the same behavior.

Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you

PacmanForever avatar Sep 28 '23 12:09 PacmanForever

I'm seeing the same behavior. Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you

Are they in the same docker network?

louislam avatar Sep 28 '23 13:09 louislam

For me they are on different machines, and I am.using pihole for DNS. In order to get it to resolve I originally had ui.app.domain.com, this didn't resolve for uptime, but when I changed the name to app.domain.com it worked, and also logs.api.app.domain.com would work because it has five parts.["logs","api","app","domain","com"]

dnldpavlik avatar Sep 28 '23 13:09 dnldpavlik

I'm seeing the same behavior. Monitors resolving against public DNS (cloudflare) are working fine. Monitors resolving against a private zone in my local network are all failing. It's resolvable from inside the container via ping (using the same DNS server as configured in the monitor). Pihole is being used internally.

Hi, I see that my uptime kuma can't monitor my pihole. Both are containers in the same machine. I guess the same thing happens to me as you, but I don't understand what you say to solve it. Can you explain it to me? thank you

Are they in the same docker network?

No, pihole: 172.18.0.7 (docker IP) u. kuma: 172.16.0.4 (docker IP)

but I use the ip of the host (192.168.1.2) in the same way that other monitors of other containers have the same ip to monitor with ping, etc

PacmanForever avatar Sep 28 '23 13:09 PacmanForever

For me they are on different machines, and I am.using pihole for DNS. In order to get it to resolve I originally had ui.app.domain.com, this didn't resolve for uptime, but when I changed the name to app.domain.com it worked, and also logs.api.app.domain.com would work because it has five parts.["logs","api","app","domain","com"]

I only use ip addresses.

PacmanForever avatar Sep 28 '23 13:09 PacmanForever

i seem to have the same problem. they're also on the same docker network (also tried separating it and same problem) this is my configuration. (I have kept 20s for sake of testing) image

burnthoney avatar Sep 29 '23 12:09 burnthoney

i seem to have the same problem. they're also on the same docker network (also tried separating it and same problem) this is my configuration. image

i kept 20seconds for the sake of testing

burnthoney avatar Sep 29 '23 12:09 burnthoney

What we need is a CURRENT, publicly accessible (=reproducible) testcase. The first part of this issue was resolved when we switchd from cachable lookup to NSCD (Name Service Cache Daemon)

image

The other comments are likely unrelated to the first one. I think continuing this issue in less messy smaller issues (=issues which allow reproduction) is more productive rather than piling onto a resolved issue => closing as resolved

CommanderStorm avatar Mar 14 '24 12:03 CommanderStorm