pdns metrics for downstream state changes and total downtime

Program: dnsdist
Issue type: Feature request

Short description

new prometheus metric showing a counter how often the status of a resolver changed.

Usecase

For some reason we have a flapping resolver. The logs show:

Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:19 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:34 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:36 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:57 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:58 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'

Since the outage is usually lasts just 1-2 seconds it remains largely invisible when monitoring dnsdist_server_status, therefore we would propose to add two new counters to dnsdist's prometheus metrics to make these issues visible to monitoring.

Description

Given these events:

Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:18 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'

the new metrics would contain:

dnsdist_server_status_changes_total{server="109_70_100_136:53"} 3
dnsdist_server_status_down_seconds_total{server="109_70_100_136:53"}  2

Oct 21 '22 09:10 appliedprivacy

That sounds like a very good idea, thanks! I have put this in the 1.9 milestone as we are (hopefully) near the first alpha release of 1.8 and I'm afraid I will not have to actually implement that change before the first beta (after which we are in "bug fixes only" until the final release), but I will gladly merge a pull request before the beta if someone else feels up to it :)

Oct 21 '22 09:10 rgacogne

https://github.com/PowerDNS/pdns/pull/13009 added a counter for the number of health-check failures, which should mostly cover the first need. I'll ponder the "total downtime" one.

Aug 14 '23 12:08 rgacogne