riff-raff
riff-raff copied to clipboard
Scheduler status visibility
A scheduled job (see #476) can fail to kick off for a few reasons and it should be really obvious through the interface when this has happened. Similarly, if a scheduled job fails someone probably wants to hear about it.
We might:
- Add a status dashboard that shows all failed scheduled jobs
- Add a topic or other notification mechanism for letting people know about failed jobs.
Could adding an email address field when scheduling the deploy be a simple way to get feedback in case of an issue?
@nicl Now that we've had some use of this shall we pick it up again on Monday to figure out what we need?
Talking with @sihil @adamnfish has also suggested sending emails on any failed deploy which would also solve this issue (in a basic sense).
In order to implement this we need some way of getting an e-mail address to notify. I suggest that we use Prism Owners for this (https://github.com/guardian/prism/blob/master/app/data/Owners.scala).
To use prism owners we'd need to gather the set of SSAs being deployed, look them up in Prism and then actually fire off the e-mails. This means having a place in the code where we can detect failure where we have access to the set of SSAs (or data that allows us to derive this). I suspect that this place is the DeployGroupRunner
that has the DeployContext
(which contains the parameters and the task graph) and also sees the failure events. Having said that the task graph has lost easy access to the app and stack data so this will likely be non-trivial.