netdata-cloud
netdata-cloud copied to clipboard
[Feat]: Marking metrics or collectors as required for alerting purposes
Problem
From: https://community.netdata.cloud/t/marking-metrics-or-collectors-as-required-for-alerting-purposes/4613/1
So I had an automatic upgrade on a node this morning which upgraded a bunch of stuff including Netdata. The result was that the web_log collector stopped working and I have a big gap in my charts for today. The issue was easily resolved via a restart of Netdata but this seems like an area for improvement: namely, some form of alerting for when we stop getting any metrics which we were already getting (maybe ML can be leveraged here too).
Description
We should have some way of alerting when collection of metrics stops but the agent is still running and those metrics were already being collected for some time. You probably want some sort of weighting or credit system so that a metric can build up weight or credit the longer it runs and could be considered long-lived. That way if the metrics stops getting collected some logic could check how if the metric was long-lived and raise up an alert.
OTOH, this could also be useful for transient alerts that you might get when metrics only start being collected, where an alerts could be given less weight or ignored because it is not considered long-lived and thus established. Obviously, this should probably be a configurable thing so that ephemeral stuff gets handled differently as that stuff might never accumulate enough credit to be considered long-lived.
Importance
must have
Value proposition
- I'd say that this is a really big issues because, as in my case, you can think that you are monitoring correctly because the agent is live and all the other metrics are working fine and yet you could be missing some major metrics and, possibly, critical alerts.
Proposed implementation
No response
This issue has been mentioned on the Netdata Community Forums. There might be relevant details there:
https://community.netdata.cloud/t/marking-metrics-or-collectors-as-required-for-alerting-purposes/4613/3
Any update on this?