modernisation-platform icon indicating copy to clipboard operation
modernisation-platform copied to clipboard

Notification/alerting system in Modernisation Platform

Open ewastempel opened this issue 10 months ago • 2 comments

User Story

As a Modernisation Platform member user I want to be notified about health of resources or upcoming events that need actioning (e.g. expiring certificates) So that I can react and fix/prevent an issue.

Value / Purpose

Healthy application/system means no outages.

Additional Information

This want came as an ask channel request and it is to look if we can implement a new one or reuse our existing alerting system that can be easily consumed by the members.

Currently MP alerting workflow is made of: CloudWatch -> SNS -> PagerDuty -> Slack and using this solution is documented here.

This ticket is to remove the need of PagerDuty acting as a mid-man and to integrate with a variety of resources (CloudWatch, EventBridge, SNS) rather than being limited to one only (although it could start with one and then build on it).

The user that requested this, suggested EventBridge → SNS → e-mail → Slack as an approach described in here, which could be considered.

Definition of Done

  • [ ] New alerting system is implemented (ideally as a module) and it does not use PagerDuty
  • [ ] If applicable, tests are implemented
  • [ ] User docs have been updated
  • [ ] Another team member has reviewed
  • [ ] Pipeline runs green

ewastempel avatar Apr 17 '24 17:04 ewastempel

Is this potentially too broad? Is this ticket meant to cover the creation of a new alerting/notification module that we can use, or a one-off to cover alerting when certificates are reaching their expiration date that could later be extended to replace PagerDuty as a middleman?

Is this something that customers are presently empowered to do without us being involved?

dms1981 avatar May 09 '24 09:05 dms1981

As you noted in Slack, @ewastempel , maybe this is a better fit for enrolling with Observability Platform and getting the information through there?

dms1981 avatar May 09 '24 09:05 dms1981

🤔 For this ticket I'm thinking of creating a generic module (perhaps called modernisation-platform-aws-health-events) that creates an eventbridge rule that monitors aws health events , posts these to an SNS topic which can then either be hooked in to by email or perhaps even Slack with AWS ChatBot - as described here

This would capture the needs of the user as certificate renewals are posted as health events but would also serve as a more generic tool for users to configure alerts for other important health events.

If possible perhaps the module could be configurable to point at particular services rather than all. We'll see

richgreen-moj avatar Jul 09 '24 12:07 richgreen-moj

To answer @dms1981 question, this is to:

  1. Remove the need of the PD acting as a mid man as stated in the description and the DoD. In the absence of Observability Platform functionalities, this can be implemented as a tf module.
  2. Fix the user's problem

@richgreen-moj I am not fully aware of AWS health check capabilities and limitations, so your plan sounds fine in theory, but I would like you to implement it in the above order (1, then 2). This means it should integrate with a variety of AWS services, networking, IAM, lambda, certificates. Therefore rather than resolving the problem for the user first, make sure it resolves the problem for us (we want to replace our existing alerting that uses PD to use this new solution).

ewastempel avatar Jul 09 '24 16:07 ewastempel

Further to my previous comment, if the 1st is not achievable, the 2nd shouldn't be implemented. However it is still worth doing 1st, even if the 2nd cannot be achieved.

ewastempel avatar Jul 09 '24 16:07 ewastempel

So current process is...

  1. Create SNS topic
  2. Create Alarm (which notifies topic)
  3. Integrate to PagerDuty (fun and games)

If we the aim is to bypass PD then we could use AWS ChatBot and I could use this TF resource in a module that can be called where you can provide a list of existing topics etc that you want to get notified on via Slack.

I think there would be a manual element to setting up a chatbot slack client per account but once that's done you can do the rest in code.

richgreen-moj avatar Jul 10 '24 07:07 richgreen-moj

This PR https://github.com/ministryofjustice/modernisation-platform-terraform-aws-chatbot/pull/1 provides the detail on the new AWS Chatbot module which I have tested out in Sprinkler as well as writing some unit tests for the module.

richgreen-moj avatar Jul 18 '24 15:07 richgreen-moj