telemetry-analysis-service icon indicating copy to clipboard operation
telemetry-analysis-service copied to clipboard

Clean Up telemetry-alerts notification

Open fbertsch opened this issue 7 years ago • 1 comments

I think there are too many notifications coming out from ATMO. For example failing-job should probably be removed - it's just an expected failure. Lots of jobs are failing every day, and makes it difficult to parse which are important and which aren't.

Maybe we can have some sort of tiered alerts, e.g.:

  1. We alert on the first failure after a success
  2. We alert on each Nth failure after the first failure (N=7 would mean once a week)
  3. We ensure follow-up on failure, and require job remove/deactivation after some Mth failure.

fbertsch avatar Jul 14 '17 14:07 fbertsch

@fbertsch Ugh, lemme remove "failing-job" first, since that's silly to track after we found solutions to the issues with Celery.

The other ideas seem interesting, although vary in their complexity. We're effectively catering to a very small number of people (job owner and telemetry-alerts subscribers) with this feature request so I'd err on the side of simplicity than spending time on implementing complex alerting schemes.

Also since we have different intervals of job execution (daily, weekly, monthly) I wonder if this really only matters to the daily jobs, since for weekly or monthly jobs only ever alerting once (or similar) could be a problem for people missing the mail.

Would having an "alert dashboard" on the site for admins (~telemetry-alerts subscribers) be useful? Basically an overview of all the recent job failures that would better fit into the maintenance work of the telemetry team?

jezdez avatar Sep 20 '17 15:09 jezdez