DNS Timeout almost every 5 minutes.
Hello. I am running 2 servers with a locally installed instances (via apt on Ubuntu 22.04) of the DNS server. Everything works fine, except that every 5 minutes (almost to the dot) I have a 3 seconds of downtime.
I caught it because I am using DNSDIST (from PowerDNS) in from of the DNS Server instances. I tried to correllate the timeout with the DNS Server log but nothing really match. Weird thing is that it happens on both of my 2 instances, not at the same time but the delay is about the same. The DNS servers each have about 60M queries per days. I have the exact same setup in my lab where there as a lot less queries and I never saw that kind of timeout.
At first I thought it might have been a DNSDIST problem of some sort, but I was able to get the timeout when querying the DNS Server every seconds, and I would get that timeout for 3 seconds and it would simply resumed normally for about 5 minutes, then timeout... etc...
An you can see in my syslog, (dns1 is the server and 10.24.24.24:538 is the ip:port I used for receiving dns-proxy):
Nov 14 13:33:02 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 13:33:05 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 13:38:08 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 13:38:12 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 13:43:15 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 13:43:18 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 13:48:21 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 13:48:24 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 13:53:27 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 13:53:30 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 13:58:34 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 13:58:37 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 14:03:40 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 14:03:43 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 14:08:46 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 14:08:49 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 14:13:52 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 14:13:55 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 14:18:58 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 14:19:01 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up' Nov 14 14:24:04 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'down' Nov 14 14:24:07 dns1 dnsdist[1393585]: Marking downstream 10.24.24.24:538 as 'up'
The service is not restarting, the only problem with this is when a client would hit the seconds it's timing out, dnsdist might not have sent the query to the "other alive server" and at a very random time, both my servers could be timing out at the same time.
I have not tried the new version 14. I'm still on Version 13.6.
I know that Technitium DNS is made to handle a lot of querys, so I am pretty sure it's not a load issue as the problem even happens during the night where the amount of querys are really lower.
Anyone seen this timeout issue (maybe not with dnsdist)?
Thanks!
Thanks for the post. Please share a screenshot of the DNS server's Dashboard and that of the Settings > Cache section either here or to [email protected].
Hello @ShreyasZare !
Thank for the quick answer and I'm sorry that it took me so long to reply.
This is today's dashboard
The cache settings page:
I tries changing values that are set to 5 minutes just to see if it would affect that "downtime" but it did not do anything.
Thanks!
Hello,
is this issue fixed in 14.2?
@Z3r0Dayz404Qc Thanks for the details. The settings look fine and I dont see any issue with it. Its not clear what could be the issue here. I would suggest that you try to use the DNS Client tab on the admin panel when the issue triggers to test if the DNS server is at least responding to localhost ("This Server") queries.
If localhost queries are working, it could be some issue with the network. I have seen similar issues a couple of times which came due to IP conflict causing the queries to go to a different server for a while till the DNS server's OS issues an ARP lookup which fixes it till then. Do check if any device on the network has same IP statically configured. Running Wireshark/tcpdump on the network should help see this by observing ARP broadcast packets.
Settings -> General -> Max Concurrent Resolutions
Check the value and do not use the default setting.
In my opinion, this value is too low for bulk processing.
I believe the default value should be set to a higher value.
Settings -> General -> Max Concurrent Resolutions
Check the value and do not use the default setting.
In my opinion, this value is too low for bulk processing.
I believe the default value should be set to a higher value.
@ywlee03 This value really is hardware dependent so a value that works for one server wont work for other. Setting too high number would cause issues if there are too many recursive resolution requests. Increasing the number or concurrent async tasks can cause a thread pool exhaustion kind of situation where the CPU is too busy with too many tasks at hand. In such case, a bulk of tasks will fail with timeout errors despite a DNS response being available to be read from the underlying socket. The reason for this is that DNS queries have timeout of 1.5 x 2 seconds and if the task gets executed later than that would trigger timeout exception instead.
@ywlee03 I have 4 CPU cores on each of my 2 VMs, I had set the value you are talking about to 300 instead of the default 100, What is very weird in my problem is the fact the 3 seconds timeout will happens at almost exatly 5 minutes, sometimes it could skip a few extra minutes but it's very consistent. I am trying to lower this value to 200 just to see what would happen, but just by writing those lines, I already saw my teimout hapening again.
To answer @ShreyasZare about the network issue and test with the local client: the problem is it happens for a duration of 3 seconds, and my "dns-dist" monitoring is polling a locally hosted zone every seconds and the check fails hence ginving me the "dns is down" error seen in the syslog.
To answer @ShreyasZare about the network issue and test with the local client: the problem is it happens for a duration of 3 seconds, and my "dns-dist" monitoring is polling a locally hosted zone every seconds and the check fails hence ginving me the "dns is down" error seen in the syslog.
I would suggest that you run another monitoring tool which checks the server using ICMP ECHO (ping). This will tell you if the DNS is not responding or that the server itself is not responding.
To answer @ShreyasZare about the network issue and test with the local client: the problem is it happens for a duration of 3 seconds, and my "dns-dist" monitoring is polling a locally hosted zone every seconds and the check fails hence ginving me the "dns is down" error seen in the syslog.
I would suggest that you run another monitoring tool which checks the server using ICMP ECHO (ping). This will tell you if the DNS is not responding or that the server itself is not responding.
When I did some troubleshooting, on both my DNS which this issue happens, I was loggued in using SSH and was doing a tail -f /var/log/syslog, and I never losse the connections. I also did a batch script to test the DNS using dig every 0.5 seconds to query the same entry that dns-dist query that is hosted locally by the dns service and I am also getting a timeout as I was suspecting dns-dist to be the problem, but dig also detected that timeout.
Thanks for the info. So it is not an network issue since SSH session keeps working. What uptime do you see for the DNS server in the About section on the web panel? If its roughly 5mins then the DNS service may be restarting.
Do you see any errors being logged in the Logs tab on the web panel?
Thanks for the info. So it is not an network issue since SSH session keeps working. What uptime do you see for the DNS server in the About section on the web panel? If its roughly 5mins then the DNS service may be restarting.
Do you see any errors being logged in the Logs tab on the web panel?
I'm at 5 months and 3 months uptime now (I had restarted one of the 2 to test something at one point).
Thanks for the details. I am not sure what could be the issue here since its not getting reproduced anywhere else. You are seeing same issue with both your servers so seems like its due to something that is common with both these installations.
I would suggest that you get another instance deployed as a secondary just to see if it shows the same issue. But, make sure to have this instance configured to run only the DNS server and do not use the same server config you currently have. It would be best to have the latest Ubuntu Server release and install the DNS server using the official install script.