blackbox_exporter
blackbox_exporter copied to clipboard
SLI/SLO friendly metrics
Building and maintaining software nowadays require companies to take care of hundreds or thousands of VMs running across multiple regions and a reliable network connection is vital. To be able to maintain large scales with small SRE teams, it is critical to find the balance between reliability and feature velocity. SLIs and SLOs have become quite popular for SRE teams to find such balance.
Blackbox-exporter is the most common tool to monitor such things. However, the metrics it exposes are not friendly to SLIs/SLOs.
- SLIs for success rates are built with counters
- SLIs for latencies are built with histograms
SLOs/SLIs are a thing for quite a while so I would not be impressed if this discussion haven't happened already. I tried to find issues/PRs in this repository but failed to find a reason why counters and histograms are not exposed yet 🤔
I know of the existence of smokeping_prober, but I'm wondering if the Prometheus team would accept this change so we can keep using existing tools (e.g. Prober CRD from Prometheus-Operator)
I'd be very happy with getting counters and even a support for histograms. While we could do some queries around alerting on gauges and uptime-based SLOs, having counters to do request-based would be my preferred way going forward. This has been discussed previously https://github.com/pyrra-dev/pyrra/issues/46.
I assume that the problem with histograms are the default buckets not fitting all use cases. Maybe it's possible to support configurable buckets like smokeping_prober as a stop gap. The real solution seems to be the sparse histograms.
Given blackbox_exporter is stateless, it is hard for it to keep a count of anything, could you expand a bit on what you had in mind? I think some of this could be better served by giving people example recording rules which can be plugged in the various SLO frameworks.
Given blackbox_exporter is stateless, it is hard for it to keep a count of anything
Ah good point, you mean that blackbox-exporter doesn't have a metrics registry? If that's the case, any known problems that prevent us from adding a registry?
could you expand a bit on what you had in mind? I think some of this could be better served by giving people example recording rules which can be plugged in the various SLO frameworks.
The use-case I have at the moment is that I'm providing ephemeral VMs for devs to do their daily work, and I need to make sure that the internet connection is stable enough for the large majority. I could alert individually by VM, but I'd really like to just follow the SLO approach instead 😅
Ah good point, you mean that blackbox-exporter doesn't have a metrics registry? If that's the case, any known problems that prevent us from adding a registry?
There is a registry, but the probing is done via a multi target exporter pattern -- so the problem is exporting the metrics via anything other than the scrape being requested, would break assumptions (e.g. imagine adding a success and failure metric on /metrics -- but you point two prometheus instances at the prober, or someone runs a probe manually to debug things -- now the metrics are meaningless).
The use-case I have at the moment is that I'm providing ephemeral VMs for devs to do their daily work, and I need to make sure that the internet connection is stable enough for the large majority. I could alert individually by VM, but I'd really like to just follow the SLO approach instead 😅
I don't think that needs anything special from blackbox_exporter, you can make an SLI out of something like: sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(up{job="probe-vm"}[1h])
There is a registry, but the probing is done via a multi target exporter pattern -- so the problem is exporting the metrics via anything other than the scrape being requested, would break assumptions (e.g. imagine adding a success and failure metric on /metrics -- but you point two prometheus instances at the prober, or someone runs a probe manually to debug things -- now the metrics are meaningless).
Aaaah yes of course. Yep, indeed it looks like the design is not appropriate for counters and histograms.
I don't think that needs anything special from blackbox_exporter, you can make an SLI out of something like:
sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(up{job="probe-vm"}[1h])
Thanks! Counters and histograms can definitely make the query look more readable and performant, but your suggestion may do the trick 🙂.
I'll try using it for a couple of weeks and come back if I find some weird behaviors
There is a registry, but the probing is done via a multi target exporter pattern -- so the problem is exporting the metrics via anything other than the scrape being requested, would break assumptions (e.g. imagine adding a success and failure metric on /metrics -- but you point two prometheus instances at the prober, or someone runs a probe manually to debug things -- now the metrics are meaningless).
Aaaah yes of course. Yep, indeed it looks like the design is not appropriate for counters and histograms.
I don't think that needs anything special from blackbox_exporter, you can make an SLI out of something like:
sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(up{job="probe-vm"}[1h])Thanks! Counters and histograms can definitely make the query look more readable and performant, but your suggestion may do the trick 🙂.
I'll try using it for a couple of weeks and come back if I find some weird behaviors
why not use: sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(probe_success{job="probe-vm"}[1h])
Counters are unnecessary for generating SLIs/SLOs with the blackbox exporter. These can be produced with PromQL.
- Availability (success rate) is easily produced with
avg_over_time(probe_success[<SLO window>]). - Latency is easily produced with
quantile_over_time(0.99, probe_duration_seconds[<SLO window>]).
The reason the smokeping_prober produces its own counters/histograms is because it's probing at a frequency much higher than you would typically want to produce with Prometheus scrape interval. It also allows for overlapping probes due to the way it's implemented.
We're discussing adding additional prober support to the underlying library prometheus-community/pro-bing. That way the smokeping_prober could also do HTTP tests.
For the blackbox_exporter, I think this request is out of scope.
Counters are unnecessary for generating SLIs/SLOs with the blackbox exporter. These can be produced with PromQL.
- Availability (success rate) is easily produced with
avg_over_time(probe_success[<SLO window>]).- Latency is easily produced with
quantile_over_time(0.99, probe_duration_seconds[<SLO window>]).The reason the smokeping_prober produces its own counters/histograms is because it's probing at a frequency much higher than you would typically want to produce with Prometheus scrape interval. It also allows for overlapping probes due to the way it's implemented.
We're discussing adding additional prober support to the underlying library prometheus-community/pro-bing. That way the smokeping_prober could also do HTTP tests.
For the
blackbox_exporter, I think this request is out of scope.
Thanks!It is helpful for me.
The problem with quantile_over_time is it performs very slowly compared to other functions you mentioned.
A histogram support would probably help a ton with it.