paasta icon indicating copy to clipboard operation
paasta copied to clipboard

Alert authors of downstream jobs when a scheduled job has failed

Open Rob-Johnson opened this issue 7 years ago • 3 comments

I'm proposing that when a Job fails (either through the run failing, or the scheduler not running it as expected), we should notify all the owners of downstream jobs, too. I think we'll want to go all the way down the DAG. I think this is just an extension of the recent monitoring we added for when jobs didn't run on time, it's just that the reason is that an upstream job failed, rather than schedule.

this is more a RFC: @oktopuz @keymone @solarkennedy what do you think? If you agree, I'll work on a PR

Rob-Johnson avatar Mar 08 '17 09:03 Rob-Johnson

What if we traverse the dag and gave that as part of the output of the alert to remind people what other jobs this blocks? (instead of alerting on every downstream job and flooding peoples inbox)

solarkennedy avatar Mar 08 '17 17:03 solarkennedy

how often does it happen that downstream jobs belong to somebody other than owners of the root-set? also i think it's sufficient to notify on a failure of complete DAG rather than individually per DAG-node.

mks-m avatar Mar 12 '17 00:03 mks-m

@keymone it was mentioned by one of the ads teams that they often had this issue.

@solarkennedy I'm fine with that as a less noisy solution.

Rob-Johnson avatar Mar 13 '17 14:03 Rob-Johnson