karma
karma copied to clipboard
No metrics for probeVersion failures (bug ?)
I get sometimes 503 errors in the logs :
karma-5666566999-n28pv karma level=error msg="Request failed" error="request to https://***redacted***alertmanager:9093/metrics failed with 503 Service Unavailable" alertmanager=***redacted*** uri=https://***redacted***alertmanager:9093/
It seems that the error message comes from /internal/alertmanager/models.go#L92
And probeVersion is called at /internal/alertmanager/models.go#L366.
Question 1 : could you confirm this ?
In this code, I also notice that when an error occur, probeVersion will return "" with some logging, but :
- fetching the status (line 379) will not be blocked. How can it work if you got a 503 error when trying to retrieve the Alertmanager version ?
- there is no metric to show that probing the version failed.
Question 2 / Bug ? : when probing the version fails, but Karma goes on retrieving silences and alerts, is this a bug ?
Question 3 / Feature request : Could you create a metric that shows when probing Alertmanager version failed ? Shouldn't it stop at line 370 ?
For this feature request, maybe you could create a metric named karma_alertmanager_probed_version with the version as a label and with the value set to 1, or 0 if something failed ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello,
Any ideas about my questions/feature request ?
It's not a bug, if karma cannot detect alertmanager version it assumes latest compatible version. What do you need a metric for? It sounds like your alertmanager is failing with 503 (or whatever it's behind).
I have not had 503 errors for a while.
When I wrote this issue, maybe there was a problem on Alertmanager that I could not reproduce at that moment, and that blocked Karma for the version but not for the alerts&silences.
As you say, Karma assumes the latest compatible version when it cannot retrieve the Alertmanager version. This it why Karma was still working and I noticed nothing but the log with the 503 error.
No problem for a while : should we close this issue ?
About feature request and the metric, it could be a counter that increments every time Karma fails to connect to Alertmanager. This should be easier to have a native metric than creating a custom metric with Promtail matching on 503. But I have had no problem for a while : do I still need it ? I don't know...
There's karma_alertmanager_errors_total & karma_alertmanager_up metric already exported
Thanks, I'll give a try on these metrics.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.