modernisation-platform
modernisation-platform copied to clipboard
Alerts of Instance Scheduler Failures
User Story
As a Modernisation Platform Engineer I would like to be notified if the instance scheduler has failed to shut down/start up instances in a member account so I could investigate and resolve issues proactively without requiring customers to raise support queries in the ask channel. Additionally ensure that Instance Scheduler provides comprehensible failure messages
User Type(s)
Modernisation Platform Engineer Modernisation Platform Customer
Value
This will allow us to see the effectiveness of our services and proactively identify and correct issues.
Proposal
Potentially integrate with PagerDuty via SNS
Definition of done
- [ ] Approach decided
- [ ] Alerts defined
- [ ] readme has been updated
- [ ] user docs have been updated
- [ ] another team member has reviewed
- [ ] tests are green
- [ ] UR test OR added to continual research plan
Reference
"Add the SNS arn via terraform to the DLQ" I cannot find a dead letter queue, I have the arn though (pro).
Parking until next year.
This issue is stale because it has been open 90 days with no activity.
After some reading of AWS documentation on Lambda asynchronous invocation and SNS topic subscription it appears that the lambda terraform module needs to allow to pass SNS topic(s) arn in order to link the instance scheduler to an SNS topic. It also looks like the DeadLetterQueue is less informative as an approach to alerting on failure, than configuring SNS topics as Lambda's destination config. Therefore the following is the plan of implementation:
- Creating SNS topics (one for success and one for failure)
- Update the instance scheduler IAM policy to be able to publish to the SNS topics
- Subscribe PagerDuty/slack alerts to the SNS topics (there might be more configuration there, when separating success/failure alerts to post to different slack channels).
- Update the lambda terraform module to consume SNS topics (one for failure and one for success) and create a release.
- Update the instance scheduler to use the newly released lambda module.
This is implemented, but the alerting is not working as expected.
Manually added an email subscription for test purposes. Adding an email as a subscription did not work with lambda being triggered, so then manually published a message to the SNS topic and that ended up in the inbox, but then again it did not end up in any slack channel (I would expect it to be in the low-priority-alarms). Therefore:
- lambda -> SNS is not configured right
- SNS -> PD/slack is not configured right
The next step is to figure out why and to fix the configuration.
The link between lambda and SNS appears to be working after all, but the SNS -> PD is still to be fixed.
PD was unable to read the lambda original request return message that was directed to the SNS topics as it has a format unsupported by the CloudWatch. To overcome that, the cloudwatch alarms were added to monitor the instance scheduler lambda function and if it fails (runs with errors) or gets throttled it then posts a message to the SNS topics that were implemented earlier. This in turn can be interpreted by the PagerDuty api and it further alerts in the #modernisation-platform-low-priority-alarms channel.
This has been tested with the successful instance scheduler lambda function invocations.