riff-raff icon indicating copy to clipboard operation
riff-raff copied to clipboard

Scheduler status visibility

Open sihil opened this issue 7 years ago • 4 comments

A scheduled job (see #476) can fail to kick off for a few reasons and it should be really obvious through the interface when this has happened. Similarly, if a scheduled job fails someone probably wants to hear about it.

We might:

  • Add a status dashboard that shows all failed scheduled jobs
  • Add a topic or other notification mechanism for letting people know about failed jobs.

sihil avatar Jan 18 '18 14:01 sihil

Could adding an email address field when scheduling the deploy be a simple way to get feedback in case of an issue?

alexduf avatar Jan 22 '18 10:01 alexduf

@nicl Now that we've had some use of this shall we pick it up again on Monday to figure out what we need?

sihil avatar Feb 09 '18 19:02 sihil

Talking with @sihil @adamnfish has also suggested sending emails on any failed deploy which would also solve this issue (in a basic sense).

nicl avatar Feb 12 '18 12:02 nicl

In order to implement this we need some way of getting an e-mail address to notify. I suggest that we use Prism Owners for this (https://github.com/guardian/prism/blob/master/app/data/Owners.scala).

To use prism owners we'd need to gather the set of SSAs being deployed, look them up in Prism and then actually fire off the e-mails. This means having a place in the code where we can detect failure where we have access to the set of SSAs (or data that allows us to derive this). I suspect that this place is the DeployGroupRunner that has the DeployContext (which contains the parameters and the task graph) and also sees the failure events. Having said that the task graph has lost easy access to the app and stack data so this will likely be non-trivial.

sihil avatar Feb 12 '18 13:02 sihil