netdata icon indicating copy to clipboard operation
netdata copied to clipboard

[Feat]: add `1min_anomaly_rate` and `1min_node_anomaly_rate` to alarm events.

Open andrewm4894 opened this issue 3 years ago • 6 comments

Problem

We need to expose anomaly rates as part of alerts.

This feature request aims to build first piece of this by adding two new fields to alarms in the agent.

Description

  • 1min_anomaly_rate: the average anomaly rate overall dims involved in the alarm in the preceding 60 seconds.
  • 1min_node_anomaly_rate: the average overall node anomaly rate preceding 60 seconds can be taken from the anomaly_detection.anomaly_rate chart.

Importance

must have

Value proposition

  1. first step in adding AR%'s into alert templates etc to provide more context.

...

Proposed implementation

TBD with agent team input. Main idea is to either calculate these values as part of the health engine itself or to calculate them on state transition of alerts.

Doing this will enable [Feat]: Add anomaly rate into alerts templates#757

andrewm4894 avatar Mar 10 '23 10:03 andrewm4894

@MrZammler can you have a think about this one. I might follow up with you next week but just wanted to give you a heads up to start discussing if or how easy/feasible this may or may not be.

Hopefully you will tell me it is super easy and simple :)

andrewm4894 avatar Mar 10 '23 13:03 andrewm4894

@MrZammler what would you think about getting together a POC minimal PR at some stage in next week or two to do this:

  1. make 1min_node_anomaly_rate available as a variable.
  2. ability to reference and add it into the info of an alert.

Idea being a sort of minimal POC to get going.

@ilyam8 @Ferroin @shyamvalsan as fyi - idea here being to start with as simple as possible POC we can think of.

andrewm4894 avatar Apr 18 '23 10:04 andrewm4894

While we're at it, can we capture this along with both

  • triggeredValue
  • latestValue

shyamvalsan avatar Apr 18 '23 19:04 shyamvalsan

@MrZammler (Off topic) It might also be useful to get the node OS and node type (k8s vs container vs bare metal) in the alert info since troubleshooting steps can be customized accordingly.

(Feature request submitted --> https://github.com/netdata/netdata/issues/14923)

shyamvalsan avatar Apr 18 '23 19:04 shyamvalsan

@shyamvalsan you can make new feature requests for that stuff :)

andrewm4894 avatar Apr 18 '23 19:04 andrewm4894

Draft POC PR here: https://github.com/netdata/netdata/pull/15012

andrewm4894 avatar May 05 '23 13:05 andrewm4894