construct-hub icon indicating copy to clipboard operation
construct-hub copied to clipboard

Refactor doc generation processes to eliminate exponential backoff in step functions

Open Chriscbr opened this issue 4 years ago • 2 comments

Background

When the doc generation process has issues, it's possible for its DLQ to get very large. (For example, in the Gamma account the DLQ currently has 20,000+ messages). When a Lambda is invoked to redrive these messages, it simply initializes thousands of step function executions, which delegate ECS tasks for generating documentation for individual languages. Since this spike of workload is far greater than what ECS can handle at once, the step function frequently gets throttled, so it has been configured with a very slow but exponential backoff:

https://github.com/cdklabs/construct-hub/blob/02d5ff3a01768b9203ac061be7ed90200f714e92/src/backend/orchestration/index.ts#L40

Problem

This backoff-and-retry solution causes the history of individual executions gets cluttered with dozens of failed ECS task requests, and moreover, the larger the number of concurrent tasks that are being executed, the longer it will take for the state machine executions to reach a point in time where individual tasks make progress.

Proposed solution

We can make this architecture more stable by introducing a queue, and a lambda function that polls from the queue and starts tasks whenever we are not close to hitting the task limit (e.g. below 80% capacity). In the state machine that spawns the individual tasks, instead of invoking the ECS task directly, it will add the tasks to an SQS queue, and then wait to receive a response before continuing the state machine, in accordance with the pattern described here: https://docs.aws.amazon.com/step-functions/latest/dg/callback-task-sample-sqs.html We can assume if the task execution does not hear a response for at least 6 hours, then it has failed and should go into the DLQ.

This would eliminate all errors caused by trying to invoke too many ECS tasks at once, while also giving us greater flexibility over how much compute we should allocate to the ECS cluster (we could likely set up autoscaling based on the amount of items in the queue).

Open Questions

The proposed solution is a somewhat common serverless architecture pattern - is there an existing CDK L3 construct we can use to abstract this behavior? If not, could we create one?

Chriscbr avatar Oct 26 '21 01:10 Chriscbr

This issue is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon.

github-actions[bot] avatar Dec 26 '21 01:12 github-actions[bot]

Closing this issue as it hasn't seen activity for a while. Please add a comment @mentioning a maintainer to reopen.

github-actions[bot] avatar Jan 02 '22 01:01 github-actions[bot]