containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[ECS] [request]: Capacity to configure the timeout for PROVISIONING Task state

Open henriquesantanati opened this issue 2 weeks ago • 0 comments

Tell us about your request What do you want us to build?

Currently, with Amazon Elastic Container Service (ECS), there is no option to configure the maximum duration that a task can remain in the PROVISIONING state before it transitions to the STOPPED state due to a lack of available capacity. The default behavior, as mentioned in the Deep Dive on ECS Cluster Auto Scaling blog post, is as follows:

At the present time, a maximum of 100 tasks can be in the provisioning state for any cluster, and provisioning tasks will wait for capacity for between 10 and 30 minutes before transitioning to “stopped.”

This default behavior can pose challenges for certain use cases, particularly when timely notifications about capacity issues are crucial. For example, if you have an EventBridge rule set up to notify you about capacity constraints, the notification will only be triggered after the tasks have waited for the maximum duration of 30 minutes in the PROVISIONING state before being stopped.

In scenarios where a shorter waiting period is desired, or where more granular control over the task provisioning timeout is required, the current ECS implementation does not provide any configuration options. Users have to work within the existing 10-30 minute timeframe for tasks in the PROVISIONING state.

Which service(s) is this request for? ECS on EC2 using Capacity Provider

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I am trying to receive alerts when there is insufficient available capacity or resources to accommodate a new task or workload. The goal is to receive near real-time notifications whenever the existing Capacity Provider lack of resources, so that it can be taken appropriate actions to address the capacity constraints before they become a bottleneck or cause disruptions.

Are you currently working around this issue?

One potential solution is to implement a Lambda function that is triggered when a task reaches the PROVISIONING state. However, instead of acting immediately, the function can utilize an SQS Delayed Queue to introduce a deliberate delay, such as 5 minutes, before processing the task. This delay allows sufficient time for the task to potentially transition out of the PROVISIONING state on its own.

After the specified delay, the Lambda function will be invoked. It will then fetch the current state of the task using the DescribeTasks API, passing the task ID from the original event. If the task is still in the PROVISIONING state after the delay, the Lambda function can take appropriate action, such as logging the issue or triggering additional remediation steps.

It is a similar mechanism used on this blog post about monitoring ECS Agent.

henriquesantanati avatar Jun 20 '24 13:06 henriquesantanati