metrics for downstream state changes and total downtime
- Program: dnsdist
- Issue type: Feature request
Short description
new prometheus metric showing a counter how often the status of a resolver changed.
Usecase
For some reason we have a flapping resolver. The logs show:
Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:19 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:34 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:36 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:57 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:58 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Since the outage is usually lasts just 1-2 seconds it remains largely invisible when monitoring dnsdist_server_status,
therefore we would propose to add two new counters to dnsdist's prometheus metrics to make these issues visible to monitoring.
Description
Given these events:
Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:18 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
the new metrics would contain:
dnsdist_server_status_changes_total{server="109_70_100_136:53"} 3
dnsdist_server_status_down_seconds_total{server="109_70_100_136:53"} 2
That sounds like a very good idea, thanks! I have put this in the 1.9 milestone as we are (hopefully) near the first alpha release of 1.8 and I'm afraid I will not have to actually implement that change before the first beta (after which we are in "bug fixes only" until the final release), but I will gladly merge a pull request before the beta if someone else feels up to it :)
https://github.com/PowerDNS/pdns/pull/13009 added a counter for the number of health-check failures, which should mostly cover the first need. I'll ponder the "total downtime" one.