netdata-cloud icon indicating copy to clipboard operation
netdata-cloud copied to clipboard

[Feat]: Cloud based Reachability Alerts - user configurable functionality improvements

Open dogsbody-josh opened this issue 1 year ago • 4 comments

Problem

The problem is Netdata Reachability Alerts are not configurable which leads to excessive alerts during upgrades and from ephemeral nodes or those that are turned off/on on schedule.

The lack of configurability is also a blocker to getting effective notifications that reach the right teams in the right way.

Finally, because Reachability Alerts are either on or off for all nodes in a room some ephemeral nodes have to get organised differently just to accommodate this lack of functionality.

Description

Reachability Alerts need the following functionality:

  1. Configurable per individual node, room or custom/selectable set of nodes across rooms. All the points below should be configurable in this way too.
  2. Conditions for triggering the alert should be configurable. Configuration options should include (at least) a customisable delay before triggering. This will solve issues like #858 and is of particular importance to parent/child setups where a restart of the parent agent can cascade hundreds of Reachability Alerts.
  3. Control over the content of the notification, including subject and body.
  4. Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods).
  5. Ability to silence alerts per node/room/or custom/selectable set of nodes.
  6. Silences should be schedule-able, and have a 'recurring' functionality. This is so that nodes that are switched off over night or other particular recurring time period can be silenced appropriately.

Configuration of this functionality should be centrally configurable by the 'main' account for the service. This is particularly important for Business customers where a single main account will invite team members. Team members shouldn't individually have to configure Reachability Alerts, they should be managed in one secured account location and point 4 above would be used to control where Reachability Alerts are delivered.

Importance

blocker

Value proposition

Reachability Alerts are a critical component of a monitoring solution and deserve first class status. Knowing that a node is no longer reporting to the monitoring solution is as important as the monitoring itself. A node that is not reporting in has the potential to lose metrics and, more importantly, to not trigger health alerts.

Because of this we believe it is absolutely essential to immediately detect if a non-ephemeral node is no longer reachable so that relevant teams can immediately investigate and rectify the issue. This is even more important in parent/child setups where the parent is responsible for health alerts. Should such a parent node go unreachable it's possible that any subsequent health alerts for all nodes would not trigger.

Our aim is to avoid a situation where a node silently loses monitoring/alerting but be able to configure the parameters and notification options for receiving Reachability Alerts.

Proposed implementation

We don't have any specific implementation proposals, other than those alluded to above in that for Business customers with multiple invited accounts under a main Space 'owner' account, the new functionality should be placed under the main owner account.

dogsbody-josh avatar Aug 14 '24 15:08 dogsbody-josh

@dogsbody-josh : We are making improvements to the reachability notifications (from the cloud). You may have noticed that we are grouping the reachability notifications if an agent is streaming to a parent to reduce the number of notifications that are sent out. We are also introducing configurable timeouts for reachability notifications (for the space initially) where the user can define a timeout that can cater to upgrades / manual restart of the agents.

  1. Configurable per individual node, room or custom/selectable set of nodes across rooms. All the points below should be configurable in this way too. --> We will make this configurable per space initially and will introduce additional configurations later on.
  2. Conditions for triggering the alert should be configurable. Configuration options should include (at least) a customisable delay before triggering. This will solve issues like [Bug]: Unreachable alerts during upgrades #858 and is of particular importance to parent/child setups where a restart of the parent agent can cascade hundreds of Reachability Alerts. --> I think this already exists for alerts and I assume you are referring to reachability notifications (which are not alerts in the Netdata terminology) and will be solved with the configurable timeouts mentioned in 1.
  3. Control over the content of the notification, including subject and body. --> This is not on our roadmap for now and we will try and look at this later on.
  4. Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods). --> These notifications are already configurable for user-specific or organisation-specific integrations. Reachability notifications are not supported on the agent dispatched notifications.
  5. Ability to silence alerts per node/room/or custom/selectable set of nodes. --> This already exists as part of our silencing feature on Netdata Cloud (only available on paid plans though)
  6. Silences should be schedule-able, and have a 'recurring' functionality. This is so that nodes that are switched off over night or other particular recurring time period can be silenced appropriately. --> We have the scheduling feature already and the recurring functionality will be introduced soon.

cc: @car12o @juacker

sashwathn avatar Sep 11 '24 12:09 sashwathn

Hi @sashwathn this sounds like an awesome start. Just being able to have a delay on the reachability alerts will probably reduce ~90% of our alerts.

I think it's safe to say that when @dogsbody-josh mentions "Alerts" in this ticket he is talking about reachability alerts. As such, I am a little confused by your answers to 4, 5 & 6? We are not aware of any ability to make these changes to reachability alerts?

Thank you

dogsbody avatar Sep 11 '24 14:09 dogsbody

@dogsbody : You are right, I was talking about the silencing feature on the Alerts in general. For the reachability notifications, we only have a toggle to turn them on / off and may be we can introduce something similar to the alerts to schedule a silencing rule etc. @car12o @juacker : Wdyt about the silencing rules for reachability notifications?

sashwathn avatar Sep 11 '24 15:09 sashwathn

  1. Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods). Ability to silence alerts per node/room/or custom/selectable set of nodes.

This is already supported. Any notification integration configured on Cloud will deliver alerts & reachability notifications.

Regarding silencing rules (which already support schedule & recurring for alerts), we could extend this functionality for reachability as well. Sounds like a good idea to me.

car12o avatar Sep 23 '24 16:09 car12o

Configurable reachability delay is new released. You can find it at your space settings under Alerts & Notifications menu, Reachability tab.

car12o avatar Oct 08 '24 07:10 car12o

@dogsbody-josh : We have now introduced configurable timeouts for reachability notifications for the space and per room. You can access this under Space Settings --> Alerts and Notifications --> Reachability.

Hope this helps.

sashwathn avatar Oct 18 '24 09:10 sashwathn

this doesn't seem to work in the free plan.

rome-legacy avatar Jul 13 '25 23:07 rome-legacy