panic icon indicating copy to clipboard operation
panic copied to clipboard

Design a new heartbeat system for PANIC Alerter

Open dillu24 opened this issue 3 years ago • 0 comments

Rationale

PANIC has a heartbeat mechanism integrated through RabbitMQ. This heartbeat mechanism is used by the Telegram Commands Handler (TCH) and the Slack Commands Handler (SCH) to give a live status of the tool. If a component did not send a heartbeat after X seconds then the SCH and the TCH would declare that component as down, and thus whenever the user types the /status command they will be notified that that particular component is down.

The heartbeat mechanism works as follows:

  • The health-checker sends a ping request to the PING rabbit exchange
  • Every manager component subscribes to the PING rabbit exchange so that whenever a ping is received, they could respond with a heartbeat
  • Upon a ping, the manager checks whether each child processes is running and sends a heartbeat with a list of processes which are running and a list of processes which are not running

From the above one can conclude that the current heartbeat mechanism doesn't truly check whether a component is executing or not, as a process might be running but it may be running into difficulties. It is important to note that the heartbeat mechanism was designed this way because there was no other way to unblock a process which is waiting to consume data on a rabbit blocking channel. Similarly, a monitor executes every X seconds, therefore while sleeping it could net send any heartbeats.

Therefore, the aim of this ticket is to re-think this design and possibly come up with a better one.

Notes:

  • #278 may effect what we design in this ticket

For ticket closure

Come up with a heartbeat mechanism design and do the following:

  • [ ] Present the mechanism to the team
  • [ ] Document this design on confluence
  • [ ] Create tickets that implement this design

dillu24 avatar Jun 09 '22 10:06 dillu24