Ryan Clark comments

Results 61 comments of


                                            Ryan Clark

The timeout problem in watchbot 4

Hi @KeithYJohnson -- that is something we entirely rely on SQS to manage for us. [Check out their docs here](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html).

Try machine learning distribution on ecs-watchbot

This would be really cool to explore -- but might be worth waiting until ECS rolls out their upcoming service discovery system. From the sound of it, that system will...

Alarm on percentage of failing workers

Is there an ideal period for such an alarm that'd be appropriate across systems with different avg. job durations? In the statement > alarm when more than 75% of running...

Alarm on percentage of failing workers

So pragmatically, what metrics do we compose to make this metric? I think it would be - SQS.NumberOfMessagesReceived = counts messages that have been handed to a worker - Watchbot.WorkerErrors...

Retry failed task placements before giving up

Watchbot's SQS-based try and retry system kinda sorta does this already. Is there an advantage to making a failed placement a special case and not just letting the usual retry...

Retry failed task placements before giving up

The dead letter queue isn't supposed to represent chronically malformed or rejected payloads -- the idea is that SQS should never ever drop your job until it has been completed...

Overwhelming the watchbot Logger

We've run up against similar challenges before when running things as Node.js child processes. However in this case it sounds like we're probably doing that piece alright, but there's a...

Expose SQS visibility timeout to Cloudformation template.

Or maybe derive the heartbeat intervals from the `maxJobDuration` option?

Expose SQS visibility timeout to Cloudformation template.

The interval doesn't need to be multiplied because [the retry() function is only ever called after a worker process errored](https://github.com/mapbox/ecs-watchbot/blob/3c5fd1730178560c745fd792dc9fcf5862876d3f/lib/worker.js#L106). Since the first one failed there's no risk of overlapping...

Cluster parameter does not respect prefixes

That should be prefixed. Think you could put together a PR?