Design a new heartbeat system for PANIC Alerter
Rationale
PANIC has a heartbeat mechanism integrated through RabbitMQ. This heartbeat mechanism is used by the Telegram Commands Handler (TCH) and the Slack Commands Handler (SCH) to give a live status of the tool. If a component did not send a heartbeat after X seconds then the SCH and the TCH would declare that component as down, and thus whenever the user types the /status command they will be notified that that particular component is down.
The heartbeat mechanism works as follows:
- The health-checker sends a
pingrequest to thePINGrabbit exchange - Every manager component subscribes to the
PINGrabbit exchange so that whenever a ping is received, they could respond with a heartbeat - Upon a ping, the manager checks whether each child processes is running and sends a heartbeat with a list of processes which are running and a list of processes which are not running
From the above one can conclude that the current heartbeat mechanism doesn't truly check whether a component is executing or not, as a process might be running but it may be running into difficulties. It is important to note that the heartbeat mechanism was designed this way because there was no other way to unblock a process which is waiting to consume data on a rabbit blocking channel. Similarly, a monitor executes every X seconds, therefore while sleeping it could net send any heartbeats.
Therefore, the aim of this ticket is to re-think this design and possibly come up with a better one.
Notes:
- #278 may effect what we design in this ticket
For ticket closure
Come up with a heartbeat mechanism design and do the following:
- [ ] Present the mechanism to the team
- [ ] Document this design on confluence
- [ ] Create tickets that implement this design