ecs-watchbot
ecs-watchbot copied to clipboard
Alarm on percentage of failing workers
Background: For some services, you don’t need careful monitoring on worker errors if you have careful monitoring on the dead letter queue (DLQ). This is because if worker errors don’t result in a DLQ status, that means they were successfully retried. For these services the only type of worker error monitoring we’d want is to monitor widespread failure across all workers. However, if the number of workers running at a given time is variable, this isn’t accomplishable with current watchbot error alerting, which requires a static threshold
Feature request: The ability to configure the error alarm with a percentage of failures would be great: i.e. “alarm when more than 75% of running jobs are failing.”
/cc @mapbox/platform
Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?
In the statement
alarm when more than 75% of running jobs are failing
... would you want "failing" to include or exclude jobs that were later retried successfully?
For reference here is the current alarm that should be replaced by a percentage:
https://github.com/mapbox/ecs-watchbot/blob/7aa0e89b7f2d00037dfc3d2452c3dc0e151da366/lib/template.js#L73-L76
https://github.com/mapbox/ecs-watchbot/blob/7aa0e89b7f2d00037dfc3d2452c3dc0e151da366/lib/template.js#L673-L688
Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?
I guess this depends on how many jobs finish per period. As seen above, the current alarm uses a period of 1 minute. We could use the same period here.
alarm when more than 75% of running jobs are failing
I actually think we'd want 100% of running jobs failing as the detection metric. This would show that there is a systemic error that makes all jobs fail.
... would you want "failing" to include or exclude jobs that were later retried successfully?
For jobs that continuously fail and don't succeed through retries, there is the dead letter queue (and alarm on it). The idea of this alarm is to get an alarm faster if there is a systemic problem.
One thing to note: We might have to tune this alarm to not trigger if there is a very low number of tasks (e.g. 1 tasks finished, it failed, alarm). Not sure if this is something to protect against.
So pragmatically, what metrics do we compose to make this metric? I think it would be
- SQS.NumberOfMessagesReceived = counts messages that have been handed to a worker
- Watchbot.WorkerErrors = counts errors in watchbot worker scripts
So worker errors / messages received
I guess?
Note that worker errors metric wouldn't cover watcher failures, though my hunch is that is really quite rare & maybe not what this alarm is trying to grasp.