pdns Get rid of active idle loops

Program: dnsdist
Issue type: Feature request

Short description

In some threads, there are active idle loops, for example in health checks : https://github.com/PowerDNS/pdns/blob/fd93b111a1ec5791b892adf573572b28f46b2f79/pdns/dnsdist.cc#L1824 In TCP client, there is an active loop of 500 ms.

Usecase

I have deployed dnsdist successfully on low end computers, but a lot of CPU is used (wasted?) in just doing nothing every second, in the case of health checks. I didn't take the time to inspect if something is done within TCP client loop.

Description

https://github.com/PowerDNS/pdns/pull/7142 was implemented in a way not to change the sleep interval. This should have been done with a sleep adapted to the minimal needed sleep interval. In the same way, the run method of the multiplexer has fixed timeout of 500ms, which should be configurable.

Globally, the request is about thinking of doing things when they are required, instead of regularly, just in case. The active idle loop is somewhat simpler, but does not scale nor adapts to low end computers.

Mar 16 '22 12:03 Glandos

This should have been done with a sleep adapted to the minimal needed sleep interval.

The healthcheck thread has to wake up every second to update the query load and drop rate of downstream servers, and will go back to sleep right away if there is nothing else to do but that has to be done every second.

The TCP client loop handles timeouts, cleans up the downstream TCP connections cache, and check whether a dump of the current state has been requested.

the request is about thinking of doing things when they are required, instead of regularly, just in case

That sounds good. I'm afraid it's not clear to me how you suggest doing that, though. I would love to see some actual measurements, perhaps with perf, because it might very well be that we are missing some possible optimizations but without facts that's hard to track down.

Mar 16 '22 12:03 rgacogne

Thanks for the quick reply!

The healthcheck thread has to wake up every second to update the query load and drop rate of downstream servers, and will go back to sleep right away if there is nothing else to do but that has to be done every second. I can't handle all the consequences of a code modifications here, but I'm just guessing that those metrics can be computed less frequently, with an adaptation to delta variable? Of course, this setting can be set to 1 second, for backward compatibility.

The TCP client loop handles timeouts, cleans up the downstream TCP connections cache, and check whether a dump of the current state has been requested. The same thoughts can apply here. Of course, it's important to react quickly for many people, but my own servers are limited in CPU, and I tolerate that their response time is a bit longer in exchange of a lower CPU usage.

the request is about thinking of doing things when they are required, instead of regularly, just in case

That sounds good. I'm afraid it's not clear to me how you suggest doing that, though. I would love to see some actual measurements, perhaps with perf, because it might very well be that we are missing some possible optimizations but without facts that's hard to track down.

I only have some basic figures, reported by htop:

dnsdist started 1 week ago
dnsdist/healtC thread consumed 1h02 CPU time
carbon thread consumed 13m21 CPU time
each dnsdist/tcpClien and dnsdist/dohClien thread consumed 56s CPU time
main thread consumed 27s.
other threads are not significant.

So in fact, this is mainly the health check that consumed resources, but I have to check with perf (or something similar) if it's linked to the loop or to some rarely occurring operations, like a real health check. EDIT: all my downstreams are set with checkInterval=60.

Mar 16 '22 13:03 Glandos

dnsdist/healtC thread consumed 1h02 CPU time

That seems to be a lot indeed, depending on how many downstreams you have and whether they are Do53, DoT or DoH ones. On a fairly idle setup that has been running for 8 days I have 6 minutes 42 seconds of CPU time taken by the healtC thread for a single Do53 downstream, with the default 1s checkInterval.

Mar 16 '22 14:03 rgacogne

A small improvement is available in https://github.com/PowerDNS/pdns/pull/11437 but I suspect a much better one can be reached by setting setRandomizedIdsOverUDP(true) 1 if you are running master.

Mar 21 '22 09:03 rgacogne

No, I don't have time to switch to master, but I'll be patient and don't forget this.

Mar 27 '22 11:03 Glandos

We just released 1.7.2 which should improve things significantly for your use-case! We will welcome any feedback!

Jun 14 '22 09:06 rgacogne

Indeed, the healthCheck thread has a much much lower impact now. Thanks for the fix! I'll let you close this issue. In my opinion, active loops should be avoided, but if it means to re-architecture the whole software, it can be a non-goal.

Jun 25 '22 12:06 Glandos

I'm closing this for now since most, if not all, of the active loops have been removed. Please let us know if you see something weird!

Feb 02 '24 15:02 rgacogne