blackbox_exporter icon indicating copy to clipboard operation
blackbox_exporter copied to clipboard

Implement `probe_failure_info` Metric

Open slrtbtfs opened this issue 11 months ago • 1 comments

Fixes #1077.

This make information about Probe failures that is already logged is additionally available via the probe_failure_info metric.

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest of the diff is mostly refactoring.

The main change is adding a new type ProbeResult, which replaces the boolean return type of Probes. For failed Probes this type includes diagnostic information which is then published in the probe_failure_reason metric. If error messages contain information that is quite large or may change very frequently, leading to a high cardinality, its not included in this metric and instead just logged. .

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest is mostly refactoring. Since this new type is used in almost every Probe function and their tests, the diff got quite big. I hope that doesn't make the review to cumbersome

As a side effect, now most tests not only check whether a probe succeeded, but also whether failed tests failed with the correct error message. Having the error message in the return type also allowed to remove some more complex testing code that used to extract this information from the prober logs.

~~Before merging, this should still get some improved tests and polishing, however I'd like to get some early feedback about whether you agree with the implementation approach in general.~~

There is still space for improvement as some error messages provide only limited detail (e.g. the generic HTTP request failed one). IMO this makes most sense as a follow up once the basic implementation of this metric is merged.

slrtbtfs avatar Dec 12 '24 16:12 slrtbtfs

I've finally got around to finishing up on this PR :tada:

Apologies in advance for the size towards the reviewers.

The main reason it got that big is that the the new metric tries to capture error messages for every failure path in the code base. Additionally all tests have been augmented with checks whether they fail for the right reason. This might also make it easier to spot future regressions entirely unrelated to this PR.

slrtbtfs avatar Apr 28 '25 14:04 slrtbtfs

I see that probe_failure_info metric is trying to cover something that's should be covered by logs, and because of that I don't see the strong need to have this. Metrics and logs are suited for different things, and this feels like an an area where logs are better suited instead of an info metric.

What is the use-case this metric is trying to solve, and how that use-case can't be solved by logs? If we are missing logs for some failures, I would be happy to accept a PR that improves failure logging.

electron0zero avatar Oct 29 '25 13:10 electron0zero

Hi, thanks for looking into this!

The use case we're having for this is the following:

We use one Blackbox instance to monitor targets run by many different customers.

When a probe fails, we send out an alert using Prometheus + Alertmanager to the specific entity responsible for that particular Service.

Currently, the available metrics only offer scarce and incomplete information about why a probe failed, so there is AFAICT no easy way to include such information in a Prometheus generated alert.

Especially in cases, where failures are not consistent, then the recipient of the alert needs to create a ticket and someone with access to the logs (that cover all Monitoring Targets and shouldn't be available to end users only responsible for a single target) needs to search for the logs correlated with one particular alert.

This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical.

Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations.

If there was a metric containing some basic failure reasons, this could all be automated away trivially.

slrtbtfs avatar Oct 29 '25 13:10 slrtbtfs

This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical.

we made some changes in the way logger works in PR: https://github.com/prometheus/blackbox_exporter/pull/1461, and even updated some log levels for some logs lines.

Please try running build from master to give it a try.

If you notice that there are probe errors logs that are logged at debug level or Info level, but should be logged at warn or error level, please send a PR to improve them and fix the log level or add missing logs.

Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations.

We allow getting logs, and even added support to fetch logs by a target in PR: https://github.com/prometheus/blackbox_exporter/pull/1063 so users can attach links direct links to logs in alerts

You can use that endpoint or other source to enrich the Alert by adding more data into annotations. for example: links to view the logs, or attach logs directly into alert payload.

electron0zero avatar Oct 29 '25 18:10 electron0zero

Thank you for these suggestions!

Using the logs endpoint seems promising.

However, this brings up the following two issues:

  1. (Major) We frequently have multiple prober modules configured for the same target. In that case the endpoint doesn't allow ensuring, we receive the logs for the correct module. This issue would be solved by merging https://github.com/prometheus/blackbox_exporter/pull/1257. If you could give that PR a review that would be greatly appreciated.
  2. (Minor) There is a potential race condition in this approach. It is theoretically possible that a new Probe has been completed by the time the alert enricher is fetching the logs. I know neither how much of an actual problem this is, ~~nor what would be a good solution.~~ (maybe #1491 would help with both of these issues)

slrtbtfs avatar Nov 05 '25 14:11 slrtbtfs

  1. This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257.

@slrtbtfs https://github.com/prometheus/blackbox_exporter/pull/1257 is merged into main and will be included in next release.

electron0zero avatar Nov 25 '25 11:11 electron0zero

  1. This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257.

@slrtbtfs #1257 is merged into main and will be included in next release.

Thanks a lot, I'm closing this PR then. (maybe #1077 can be closed then, too?)

slrtbtfs avatar Nov 26 '25 09:11 slrtbtfs