modernisation-platform icon indicating copy to clipboard operation
modernisation-platform copied to clipboard

Alerts of Instance Scheduler Failures

Open julialawrence opened this issue 2 years ago • 3 comments

User Story

As a Modernisation Platform Engineer I would like to be notified if the instance scheduler has failed to shut down/start up instances in a member account so I could investigate and resolve issues proactively without requiring customers to raise support queries in the ask channel. Additionally ensure that Instance Scheduler provides comprehensible failure messages

User Type(s)

Modernisation Platform Engineer Modernisation Platform Customer

Value

This will allow us to see the effectiveness of our services and proactively identify and correct issues.

Proposal

Potentially integrate with PagerDuty via SNS

Definition of done

  • [ ] Approach decided
  • [ ] Alerts defined
  • [ ] readme has been updated
  • [ ] user docs have been updated
  • [ ] another team member has reviewed
  • [ ] tests are green
  • [ ] UR test OR added to continual research plan

Reference

How to write good user stories

julialawrence avatar Nov 24 '22 11:11 julialawrence

"Add the SNS arn via terraform to the DLQ" I cannot find a dead letter queue, I have the arn though (pro).

ep-93 avatar Dec 12 '22 09:12 ep-93

Parking until next year.

ep-93 avatar Dec 14 '22 08:12 ep-93

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] avatar Mar 15 '23 01:03 github-actions[bot]

After some reading of AWS documentation on Lambda asynchronous invocation and SNS topic subscription it appears that the lambda terraform module needs to allow to pass SNS topic(s) arn in order to link the instance scheduler to an SNS topic. It also looks like the DeadLetterQueue is less informative as an approach to alerting on failure, than configuring SNS topics as Lambda's destination config. Therefore the following is the plan of implementation:

  1. Creating SNS topics (one for success and one for failure)
  2. Update the instance scheduler IAM policy to be able to publish to the SNS topics
  3. Subscribe PagerDuty/slack alerts to the SNS topics (there might be more configuration there, when separating success/failure alerts to post to different slack channels).
  4. Update the lambda terraform module to consume SNS topics (one for failure and one for success) and create a release.
  5. Update the instance scheduler to use the newly released lambda module.

ewastempel avatar Dec 04 '23 18:12 ewastempel

This is implemented, but the alerting is not working as expected.

Manually added an email subscription for test purposes. Adding an email as a subscription did not work with lambda being triggered, so then manually published a message to the SNS topic and that ended up in the inbox, but then again it did not end up in any slack channel (I would expect it to be in the low-priority-alarms). Therefore:

  • lambda -> SNS is not configured right
  • SNS -> PD/slack is not configured right

The next step is to figure out why and to fix the configuration.

ewastempel avatar Dec 11 '23 17:12 ewastempel

The link between lambda and SNS appears to be working after all, but the SNS -> PD is still to be fixed.

ewastempel avatar Dec 18 '23 14:12 ewastempel

PD was unable to read the lambda original request return message that was directed to the SNS topics as it has a format unsupported by the CloudWatch. To overcome that, the cloudwatch alarms were added to monitor the instance scheduler lambda function and if it fails (runs with errors) or gets throttled it then posts a message to the SNS topics that were implemented earlier. This in turn can be interpreted by the PagerDuty api and it further alerts in the #modernisation-platform-low-priority-alarms channel.

This has been tested with the successful instance scheduler lambda function invocations.

ewastempel avatar Dec 19 '23 16:12 ewastempel