ecs-watchbot Alarm on percentage of failing workers

Background: For some services, you don’t need careful monitoring on worker errors if you have careful monitoring on the dead letter queue (DLQ). This is because if worker errors don’t result in a DLQ status, that means they were successfully retried. For these services the only type of worker error monitoring we’d want is to monitor widespread failure across all workers. However, if the number of workers running at a given time is variable, this isn’t accomplishable with current watchbot error alerting, which requires a static threshold

Feature request: The ability to configure the error alarm with a percentage of failures would be great: i.e. “alarm when more than 75% of running jobs are failing.”

/cc @mapbox/platform

Jul 28 '19 16:07 emilymcafee

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?

In the statement

alarm when more than 75% of running jobs are failing

... would you want "failing" to include or exclude jobs that were later retried successfully?

Jul 29 '19 19:07 rclark

For reference here is the current alarm that should be replaced by a percentage:

https://github.com/mapbox/ecs-watchbot/blob/7aa0e89b7f2d00037dfc3d2452c3dc0e151da366/lib/template.js#L73-L76

https://github.com/mapbox/ecs-watchbot/blob/7aa0e89b7f2d00037dfc3d2452c3dc0e151da366/lib/template.js#L673-L688

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations?

I guess this depends on how many jobs finish per period. As seen above, the current alarm uses a period of 1 minute. We could use the same period here.

alarm when more than 75% of running jobs are failing

I actually think we'd want 100% of running jobs failing as the detection metric. This would show that there is a systemic error that makes all jobs fail.

... would you want "failing" to include or exclude jobs that were later retried successfully?

For jobs that continuously fail and don't succeed through retries, there is the dead letter queue (and alarm on it). The idea of this alarm is to get an alarm faster if there is a systemic problem.

One thing to note: We might have to tune this alarm to not trigger if there is a very low number of tasks (e.g. 1 tasks finished, it failed, alarm). Not sure if this is something to protect against.

Jul 31 '19 12:07 freenerd

So pragmatically, what metrics do we compose to make this metric? I think it would be

SQS.NumberOfMessagesReceived = counts messages that have been handed to a worker
Watchbot.WorkerErrors = counts errors in watchbot worker scripts

So worker errors / messages received I guess?

Note that worker errors metric wouldn't cover watcher failures, though my hunch is that is really quite rare & maybe not what this alarm is trying to grasp.

Jul 31 '19 18:07 rclark

ecs-watchbot ecs-watchbot copied to clipboard

Alarm on percentage of failing workers

ecs-watchbot
ecs-watchbot copied to clipboard