ecs-watchbot Expose SQS visibility timeout to Cloudformation template.

In the event that a task is externally terminated, a fixed 180 seconds must pass before the work item is visible again, which may be too long for some applications. This permits watchbot apps to define their own visibility timeout for work deadlines.

/cc @rclark @jakepruitt /fyi @danpat

Sep 05 '19 23:09 mookerji

yo, @rclark @jakepruitt I have a question about how visibility timeout interacts with out backoff strategy here. Specifically this block:

https://github.com/mapbox/ecs-watchbot/blob/master/lib/message.js#L79..L86

When doing backoff, should we multiply the backoff interval by the original visibility timeout? i.e., the new visibility timeout on retries should instead be: Max(MAX_VISIBILITY_TIMEOUT, defaultVisibilityTimeout*Math.pow(2, num_receives)). Otherwise, starting with the current 180 second default visibility timeout, the new visibility timeout after a retry (with num_receives=1) will be less than the visibility timeout from the first watchbot execution, which could trigger a duplicate task.
There a comment in that block about the visibility timeout being <2^15 because of AWS rate limits, but that appears not to be the case? 2^15 sec / 3600 sec / hour ~ 9 hours. But the actual max is 12 hours. Regardless, we should check that defaultVisibilityTimeout*Math.pow(2, num_receives)) < MAX_VISIBILITY_TIMEOUT?

If those points make sense, let me know and I can add them to this PR (or make another one).

Sep 05 '19 23:09 mookerji

Or maybe derive the heartbeat intervals from the maxJobDuration option?

Sep 06 '19 01:09 rclark

@rclark oh yeah, I agree that visibilityTimeout is actually an implementation detail, and what we really should expose here is the configuration around heartbeats. I'll have a think about it.

Did you have any thoughts around the computing of the backoff interval and whether that should include visibility?

Sep 06 '19 01:09 mookerji

The interval doesn't need to be multiplied because the retry() function is only ever called after a worker process errored. Since the first one failed there's no risk of overlapping processing.

As for the 2^14, it never quite got elucidated here, as the code comment suggests. However, I think this is actually a heartbeat-retry mixup and a lurking bug.

In this cryptic AWS documentation

The new timeout period takes effect from the time you call the ChangeMessageVisibility action.

So say you receive a message with 3 min timeout. 2 min later, you call the API to change the message visibility by 2 min like we do with the heartbeat. The message is now invisible for 4 minutes. Since there is accumulation, you can't get to 15 heartbeats (2^14 + 2^15 seconds > 12 hours). But we don't check this at all when we're doing our heartbeats.

We do check it on message retry(), which is only after an explicit worker error. In this case we should be able to get to 2^15. That said, should we? The default receives before drop to dead letter is 10 and I might advocate we cap it there. If it failed 10 times, something else is probably wrong. And we don't know how many times we've heartbeated the thing anyways.

As for bugs: when we call retry() we actually may have already heartbeated the timeout to its maximum value, or even errored in an unhandled fashion on several setIntervals. In this case the call to ChangeMessageVisibility would fail even on receive count = 1.

Sep 06 '19 01:09 rclark

Stale - going to close this one out.

May 12 '23 16:05 jqtrde

ecs-watchbot ecs-watchbot copied to clipboard

Expose SQS visibility timeout to Cloudformation template.

ecs-watchbot
ecs-watchbot copied to clipboard