ecs-watchbot icon indicating copy to clipboard operation
ecs-watchbot copied to clipboard

Retry failed task placements before giving up

Open jakepruitt opened this issue 7 years ago • 3 comments

We should consider retrying on failed task placements a few seconds after the initial failure, just to confirm that the failed task placement is in fact a product of limited resources and not a result of sub-second race conditions in the scheduler.

cc/ @brendanmcfarland @rclark

jakepruitt avatar Aug 31 '17 20:08 jakepruitt

Watchbot's SQS-based try and retry system kinda sorta does this already. Is there an advantage to making a failed placement a special case and not just letting the usual retry + backoff routines handle it?

rclark avatar Aug 31 '17 23:08 rclark

@rclark I don't think we want failed task placements to wind things up in the dead letter queue. Failed task placements represent a structural limitation of the scheduler, and should be retried as close to the scheduler as possible (ideally inside of the scheduler, per chat with David Myers). These failures don't represent chronic failures of a particular payload, which is what the dead letter queue should be signaling.

jakepruitt avatar Oct 03 '17 21:10 jakepruitt

The dead letter queue isn't supposed to represent chronically malformed or rejected payloads -- the idea is that SQS should never ever drop your job until it has been completed successfully. If the scheduler can't place a task for some number of attempts, then yeah -- there's some other limitation at play, but we definitely don't want the application to lose track of the work that it was supposed to get done.

rclark avatar Oct 03 '17 21:10 rclark