blackbox_exporter icon indicating copy to clipboard operation
blackbox_exporter copied to clipboard

SLI/SLO friendly metrics

Open ArthurSens opened this issue 3 years ago • 10 comments

Building and maintaining software nowadays require companies to take care of hundreds or thousands of VMs running across multiple regions and a reliable network connection is vital. To be able to maintain large scales with small SRE teams, it is critical to find the balance between reliability and feature velocity. SLIs and SLOs have become quite popular for SRE teams to find such balance.

Blackbox-exporter is the most common tool to monitor such things. However, the metrics it exposes are not friendly to SLIs/SLOs.

  • SLIs for success rates are built with counters
  • SLIs for latencies are built with histograms

ArthurSens avatar Jun 13 '22 12:06 ArthurSens

SLOs/SLIs are a thing for quite a while so I would not be impressed if this discussion haven't happened already. I tried to find issues/PRs in this repository but failed to find a reason why counters and histograms are not exposed yet 🤔

I know of the existence of smokeping_prober, but I'm wondering if the Prometheus team would accept this change so we can keep using existing tools (e.g. Prober CRD from Prometheus-Operator)

ArthurSens avatar Jun 13 '22 12:06 ArthurSens

I'd be very happy with getting counters and even a support for histograms. While we could do some queries around alerting on gauges and uptime-based SLOs, having counters to do request-based would be my preferred way going forward. This has been discussed previously https://github.com/pyrra-dev/pyrra/issues/46.

I assume that the problem with histograms are the default buckets not fitting all use cases. Maybe it's possible to support configurable buckets like smokeping_prober as a stop gap. The real solution seems to be the sparse histograms.

metalmatze avatar Jun 13 '22 12:06 metalmatze

Given blackbox_exporter is stateless, it is hard for it to keep a count of anything, could you expand a bit on what you had in mind? I think some of this could be better served by giving people example recording rules which can be plugged in the various SLO frameworks.

dgl avatar Jun 13 '22 13:06 dgl

Given blackbox_exporter is stateless, it is hard for it to keep a count of anything

Ah good point, you mean that blackbox-exporter doesn't have a metrics registry? If that's the case, any known problems that prevent us from adding a registry?

could you expand a bit on what you had in mind? I think some of this could be better served by giving people example recording rules which can be plugged in the various SLO frameworks.

The use-case I have at the moment is that I'm providing ephemeral VMs for devs to do their daily work, and I need to make sure that the internet connection is stable enough for the large majority. I could alert individually by VM, but I'd really like to just follow the SLO approach instead 😅

ArthurSens avatar Jun 13 '22 13:06 ArthurSens

Ah good point, you mean that blackbox-exporter doesn't have a metrics registry? If that's the case, any known problems that prevent us from adding a registry?

There is a registry, but the probing is done via a multi target exporter pattern -- so the problem is exporting the metrics via anything other than the scrape being requested, would break assumptions (e.g. imagine adding a success and failure metric on /metrics -- but you point two prometheus instances at the prober, or someone runs a probe manually to debug things -- now the metrics are meaningless).

The use-case I have at the moment is that I'm providing ephemeral VMs for devs to do their daily work, and I need to make sure that the internet connection is stable enough for the large majority. I could alert individually by VM, but I'd really like to just follow the SLO approach instead 😅

I don't think that needs anything special from blackbox_exporter, you can make an SLI out of something like: sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(up{job="probe-vm"}[1h])

dgl avatar Jun 13 '22 14:06 dgl

There is a registry, but the probing is done via a multi target exporter pattern -- so the problem is exporting the metrics via anything other than the scrape being requested, would break assumptions (e.g. imagine adding a success and failure metric on /metrics -- but you point two prometheus instances at the prober, or someone runs a probe manually to debug things -- now the metrics are meaningless).

Aaaah yes of course. Yep, indeed it looks like the design is not appropriate for counters and histograms.

I don't think that needs anything special from blackbox_exporter, you can make an SLI out of something like: sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(up{job="probe-vm"}[1h])

Thanks! Counters and histograms can definitely make the query look more readable and performant, but your suggestion may do the trick 🙂.

I'll try using it for a couple of weeks and come back if I find some weird behaviors

ArthurSens avatar Jun 13 '22 15:06 ArthurSens

There is a registry, but the probing is done via a multi target exporter pattern -- so the problem is exporting the metrics via anything other than the scrape being requested, would break assumptions (e.g. imagine adding a success and failure metric on /metrics -- but you point two prometheus instances at the prober, or someone runs a probe manually to debug things -- now the metrics are meaningless).

Aaaah yes of course. Yep, indeed it looks like the design is not appropriate for counters and histograms.

I don't think that needs anything special from blackbox_exporter, you can make an SLI out of something like: sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(up{job="probe-vm"}[1h])

Thanks! Counters and histograms can definitely make the query look more readable and performant, but your suggestion may do the trick 🙂.

I'll try using it for a couple of weeks and come back if I find some weird behaviors

why not use: sum_over_time(probe_success{job="probe-vm"}[1h]) / count_over_time(probe_success{job="probe-vm"}[1h])

wangzhoufei111 avatar Nov 28 '22 06:11 wangzhoufei111

Counters are unnecessary for generating SLIs/SLOs with the blackbox exporter. These can be produced with PromQL.

  • Availability (success rate) is easily produced with avg_over_time(probe_success[<SLO window>]).
  • Latency is easily produced with quantile_over_time(0.99, probe_duration_seconds[<SLO window>]).

The reason the smokeping_prober produces its own counters/histograms is because it's probing at a frequency much higher than you would typically want to produce with Prometheus scrape interval. It also allows for overlapping probes due to the way it's implemented.

We're discussing adding additional prober support to the underlying library prometheus-community/pro-bing. That way the smokeping_prober could also do HTTP tests.

For the blackbox_exporter, I think this request is out of scope.

SuperQ avatar Nov 28 '22 17:11 SuperQ

Counters are unnecessary for generating SLIs/SLOs with the blackbox exporter. These can be produced with PromQL.

  • Availability (success rate) is easily produced with avg_over_time(probe_success[<SLO window>]).
  • Latency is easily produced with quantile_over_time(0.99, probe_duration_seconds[<SLO window>]).

The reason the smokeping_prober produces its own counters/histograms is because it's probing at a frequency much higher than you would typically want to produce with Prometheus scrape interval. It also allows for overlapping probes due to the way it's implemented.

We're discussing adding additional prober support to the underlying library prometheus-community/pro-bing. That way the smokeping_prober could also do HTTP tests.

For the blackbox_exporter, I think this request is out of scope.

Thanks!It is helpful for me.

wangzhoufei111 avatar Nov 29 '22 03:11 wangzhoufei111

The problem with quantile_over_time is it performs very slowly compared to other functions you mentioned.

A histogram support would probably help a ton with it.

wjarka avatar Feb 10 '24 12:02 wjarka