netdata-cloud icon indicating copy to clipboard operation
netdata-cloud copied to clipboard

[Feat]: Add anomaly rate into alerts templates

Open andrewm4894 opened this issue 2 years ago • 0 comments

Problem

Currently in alerts from Netdata we get some information like this:

image

I see the number 1.33% - how should i think about this - is 1.33% normal or typical for this node?

Using the corresponding anomaly rate might be one way to get some additional context of a feel for if this 1.33% is something "strange" for this node or not.

Description

Somewhere in the alert template we should add the corresponding previous 1 minute anomaly rate (or maybe straddle the 60 seconds around the alert if at all possible in terms of how the data flow), lets call this "1 min AR%".

That number alone might be useful in helping me as a user decide how to react, at very least it provides some additional context to the alert.

For example for this specific alert above i go look at the chat and see maybe a "1 min AR%" of maybe 5%. So that tells me that maybe it was considered strange for this node, but fact the 1 min AR was only 5% means its more of a spike as opposed to some persistent anomaly. Had the 1 min AR been more like 50% or more i might react differently.

image

So the request here is to add the "1 min AR" to the alert templates - could even generalise it to "N min AR" and maybe let users configure it themselves.

No idea where or how we might add it - somewhere prominent but obviously not too distracting. Maybe something like below.

image

We should also add the node anomaly rate itself too:

image

Can link out to docs to explain the new terms as well eg node anomaly rate and anomaly rate

Importance

must have

Value proposition

  1. uses something we already have, chart/dim and node anomaly rates to try to provide some additional context to an alert.

Proposed implementation

  • Add 1 min AR rate for the dim in question to alert payloads.
  • Add 1 min node anomaly rate for the node in question to alert payloads.
  • Store and treat like we do any other alert template values.
  • Render appropriately on all notification templates.

Questions

  1. @stelfrag or @MrZammler would this require some agent work to add this info to alert spec or protobuf agent sends to NC?
  2. Would also need to discuss with @netdata/cloud-be how feasible it is to get this data as part of the existing data flows. Even better if we could straddle each side of the alert timestamp to see the AR "around" the alert in addition to just "previous" to it. But this might be too complex and if so just getting AR prior to alert timestamp would be fine.

depends on [Feat]: add 1min_anomaly_rate and 1min_node_anomaly_rate to alarm events.#14701

andrewm4894 avatar Mar 08 '23 12:03 andrewm4894