ecs-watchbot
ecs-watchbot copied to clipboard
Expose SQS visibility timeout to Cloudformation template.
In the event that a task is externally terminated, a fixed 180 seconds must pass before the work item is visible again, which may be too long for some applications. This permits watchbot apps to define their own visibility timeout for work deadlines.
/cc @rclark @jakepruitt /fyi @danpat
yo, @rclark @jakepruitt I have a question about how visibility timeout interacts with out backoff strategy here. Specifically this block:
https://github.com/mapbox/ecs-watchbot/blob/master/lib/message.js#L79..L86
- When doing backoff, should we multiply the backoff interval by the original visibility timeout? i.e., the new visibility timeout on retries should instead be:
Max(MAX_VISIBILITY_TIMEOUT, defaultVisibilityTimeout*Math.pow(2, num_receives))
. Otherwise, starting with the current 180 second default visibility timeout, the new visibility timeout after a retry (withnum_receives=1
) will be less than the visibility timeout from the first watchbot execution, which could trigger a duplicate task. - There a comment in that block about the visibility timeout being
<2^15
because of AWS rate limits, but that appears not to be the case?2^15 sec / 3600 sec / hour ~ 9 hours
. But the actual max is 12 hours. Regardless, we should check thatdefaultVisibilityTimeout*Math.pow(2, num_receives)) < MAX_VISIBILITY_TIMEOUT
?
If those points make sense, let me know and I can add them to this PR (or make another one).
Or maybe derive the heartbeat intervals from the maxJobDuration
option?
@rclark oh yeah, I agree that visibilityTimeout
is actually an implementation detail, and what we really should expose here is the configuration around heartbeats. I'll have a think about it.
Did you have any thoughts around the computing of the backoff interval and whether that should include visibility?
The interval doesn't need to be multiplied because the retry() function is only ever called after a worker process errored. Since the first one failed there's no risk of overlapping processing.
As for the 2^14, it never quite got elucidated here, as the code comment suggests. However, I think this is actually a heartbeat-retry mixup and a lurking bug.
In this cryptic AWS documentation
The new timeout period takes effect from the time you call the ChangeMessageVisibility action.
So say you receive a message with 3 min timeout. 2 min later, you call the API to change the message visibility by 2 min like we do with the heartbeat. The message is now invisible for 4 minutes. Since there is accumulation, you can't get to 15 heartbeats (2^14 + 2^15 seconds > 12 hours). But we don't check this at all when we're doing our heartbeats.
We do check it on message retry(), which is only after an explicit worker error. In this case we should be able to get to 2^15. That said, should we? The default receives before drop to dead letter is 10 and I might advocate we cap it there. If it failed 10 times, something else is probably wrong. And we don't know how many times we've heartbeated the thing anyways.
As for bugs: when we call retry() we actually may have already heartbeated the timeout to its maximum value, or even errored in an unhandled fashion on several setIntervals. In this case the call to ChangeMessageVisibility would fail even on receive count = 1.
Stale - going to close this one out.