Ryan Clark

Results 61 comments of Ryan Clark

Hi @KeithYJohnson -- that is something we entirely rely on SQS to manage for us. [Check out their docs here](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html).

This would be really cool to explore -- but might be worth waiting until ECS rolls out their upcoming service discovery system. From the sound of it, that system will...

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations? In the statement > alarm when more than 75% of running...

So pragmatically, what metrics do we compose to make this metric? I think it would be - SQS.NumberOfMessagesReceived = counts messages that have been handed to a worker - Watchbot.WorkerErrors...

Watchbot's SQS-based try and retry system kinda sorta does this already. Is there an advantage to making a failed placement a special case and not just letting the usual retry...

The dead letter queue isn't supposed to represent chronically malformed or rejected payloads -- the idea is that SQS should never ever drop your job until it has been completed...

We've run up against similar challenges before when running things as Node.js child processes. However in this case it sounds like we're probably doing that piece alright, but there's a...

Or maybe derive the heartbeat intervals from the `maxJobDuration` option?

The interval doesn't need to be multiplied because [the retry() function is only ever called after a worker process errored](https://github.com/mapbox/ecs-watchbot/blob/3c5fd1730178560c745fd792dc9fcf5862876d3f/lib/worker.js#L106). Since the first one failed there's no risk of overlapping...

That should be prefixed. Think you could put together a PR?