[Feat]: Provide node offline reasons to Cloud

Open ralphm opened this issue 11 months ago • 0 comments

Problem

Nodes can become unavailable for various reasons, some of which are expected. This may trigger unwanted notifications.

Description

When a node becomes unavailable, Netdata Cloud will send out unreachability notifications, without regard to the reason the node became unavailable. For example, for Agent restarts (manual or automatic because of an Agent upgrade), it might not be needed to send out these notifications, because there's nothing "wrong" with the monitored infrastructure. Unexpected disconnects (directly via ACLK, or indirectly through streaming), however, this typically indicates a problem in the infrastructure. Either a network issue, or a problem with the host where the (child) Agent is supposed to be running.

Additionally, it would be good to be able to explicitly highlight that nodes are unreachable in the UI.

Importance

nice to have

Value proposition

Reduced notification noise.
Better visibility of offline nodes in the UI.

Proposed implementation

### Agent
- [ ] Implement a way for the Agent to be told that a scheduled shutdown is impending
- [ ] Provide a _reason_ for the scheduled shutdown (manual, agent upgrade, uninstall, decommissioning of the host, etc.) 
- [ ] Provide a TTL for the expected reconnect. If an agent does not come back online in time, this can be used to determine to send out a notification.
- [ ] Communicate node offline reason with TTL to Cloud.

### Cloud Backend
- [ ] Process incoming node offline reason.
- [ ] Use this in the `node-instance-offline` event in the `reason` field.
- [ ] When applicable, also include this in the `agent-disconnected` event, when it would normally show a clean disconnect. This is separate from the upcoming `agent-connection-dropped` event, which would have the EMQX view of the disconnect reason.
- [ ] Use the reason as an input to notification filtering.

### Frontend
- [ ] Provide a way to highlight (unexpected) offline nodes.
- [ ] Inputs to this include: node ephemerality and the offline reason.

Additionally, we could maybe use the offline reason to change the ephemerality of a node, or even automatically remove them from Cloud. E.g. if a node is known to be decommissioned (e.g. through configuration management, or scaling down a K8s node), then we can know that the node is never coming back.

Feb 10 '25 14:02 ralphm