blackbox_exporter
blackbox_exporter copied to clipboard
Keep failed result history per target
Currently the result history is stored globally across all probes. This means that if there is one target that is constantly failing, and one that only fails occasionally, the failing one will kick the rare one out of the result history.
So when we then come in and try to understand why that rare failure occurred, it is likely gone from the history.
If we were to track these separately per target, it'd be much easier to figure out what happened, without having to increase the history limit.
If I understand correctly what you are saying, you want to do some relabeling in Prometheus. For example, as shown here: https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for
With that particular configuration each target will get its own "instance" value, and each module will get its own "job", so you can query the job/instance combination.
Is that what you are trying to do?
I think this is more about the history shown in the UI.
I think it is really difficult because we can have an infinite number of targets, it is upon the requester to ask.
Oh, I understand.
I think you want to capture and upload blackbox_exporter logs, so that you can see the failure (e.g. probe_success) and go to the corresponding logs to identify the issue. You can use e.g. Loki for that.
We keep some amount of debug logs in memory in the exporter, so it's visible in the UI.
The difficulty we have is that we have a medium number of blackbox targets, around a couple hundred, broken down into 5 or so modules.
We can enable longer history, but the UI isn't organized by module or target, so it's hard to follow.
The other issue is there's no option for the blackbox exporter to log failures only. So you can only run at debug level, which is too noisy.
Having an option like --probe.log-failures
would make the logs to Loki or whatever more useful.
I developed a proxy to do this. The proxy takes /metrics call, add ?debug=true to the query, passes it to blackbox_exporter, saves the logs and metrics in a CSV file, and returns the metrics to Prometheus.
(This is YOLO quality so it's not on github)
I think we can easily implement a flag for logging errors. Maybe @igorwwwwwwwwwwwwwwwwwwww would be interested in implementing this?
I'd merge that.