ECS Deployment Fails Due to Premature Resource Availability Reporting
Summary
During ECS deployments with EC2 capacity providers, tasks in "Pending" state occur when the ECS Agent prematurely reports resource availability before old containers fully stop. This misreporting leads to deployment issues, especially when containers run long-term jobs, resulting in new tasks attempting to start on already busy instances without actually deploying new EC2 instances as expected.
Description
We are using an ECS cluster with an EC2 capacity provider. The capacity provider completely controls ASG size. We always keep two tasks on each container instance according to our EC2 sizes (r5.large) and task definition parameters (1024 CPU and 7372 Memory reservation).
Usually, problems happen during the deployment period. The last scenario was the following. We had two EC2 and four containers (two on each EC2). Deployment started and sent a stop signal to all four containers. Three containers stopped successfully and were replaced almost immediately. But the fourth container was running a long-term job that usually runs for a few hours, and it kept running (this is expected behavior). So we have the following situation:
- the first EC2 with two new tasks
- the second EC2 with one new and one old task
- one new task is stuck in the Pending state because it's trying to start on the second EC2
In the end, we have one task that is stuck in the Pending state for up to 8 hours. But what happens if all four old tasks will run long-term jobs and cannot be stopped immediately? We had this scenario, and in the end, 0 new containers started.
I discovered that ECS Agent frees up container instance resources immediately after sending a stop signal to the old container. For example, when I monitor available CPU and Memory for Container instances in the ECS cluster Infrastructure tab, I can see that ECS Agent says that the second EC2 has 1024 CPUs and 8359 Memory available even though there are still two active containers (one new and the one old in stopping state). That's why it tries to place a new container in the same EC2 where two active containers are already running.
Expected Behavior
The ECS Agent only freed up resources once the container stopped. This behavior will solve all problems: the capacity provider will deploy a new EC2 and a new task there instead of trying to utilize an already busy container instance.
Observed Behavior
ECS Agent wrongly tells that EC2 has available resources from the stopped container, but it's still active and running in real life.
Environment Details
- ECS Service task definition with binpack memory placement strategy:
{
"taskDefinitionArn": "arn:aws:ecs:eu-central-1:***:task-definition/ec2_sidekiq_production:414",
"containerDefinitions": [
{
"name": "sidekiq",
"image": "***",
"cpu": 1024,
"memory": 11059,
"memoryReservation": 7372,
"portMappings": [],
"essential": true,
"environment": [
],
"mountPoints": [],
"volumesFrom": [],
"stopTimeout": 28800,
"readonlyRootFilesystem": false,
"ulimits": [
{
"name": "nofile",
"softLimit": 75000,
"hardLimit": 100000
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "***",
"awslogs-region": "eu-central-1",
"awslogs-stream-prefix": "prefix"
}
},
"systemControls": []
}
],
"family": "ec2_sidekiq_production",
"taskRoleArn": "arn:aws:iam::***:role/backend-production-ecs-task-role",
"executionRoleArn": "arn:aws:iam::***:role/backend-production-ecs-task-execution-role",
"networkMode": "host",
"revision": 414,
"volumes": [],
"status": "ACTIVE",
"requiresAttributes": [
{
"name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
},
{
"name": "ecs.capability.execution-role-awslogs"
},
{
"name": "com.amazonaws.ecs.capability.task-iam-role-network-host"
},
{
"name": "com.amazonaws.ecs.capability.ecr-auth"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
},
{
"name": "com.amazonaws.ecs.capability.task-iam-role"
},
{
"name": "ecs.capability.container-ordering"
},
{
"name": "ecs.capability.execution-role-ecr-pull"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
}
],
"placementConstraints": [],
"compatibilities": [
"EXTERNAL",
"EC2"
],
"requiresCompatibilities": [
"EC2"
]
}
- docker info:
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)
Server:
Containers: 4
Running: 3
Paused: 0
Stopped: 1
Images: 6
Server Version: 20.10.25
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.336-256.559.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.36GiB
Name: ip-10-5-95-24.eu-central-1.compute.internal
ID: UY3D:MFD5:SRYJ:ELAF:2B2T:Z6BT:ZKD6:KRN2:MOX6:CRHI:I4IT:FOMA
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
- curl http://localhost:51678/v1/metadata:
{"Cluster":"main-production","ContainerInstanceArn":"arn:aws:ecs:eu-central-1:xxx:container-instance/main-production/fed4162115cd4f3e98d8313e1c5726f2","Version":"Amazon ECS Agent - v1.81.1 (*a4101a6e)"}
- df -h:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.7G 0 7.7G 0% /dev
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 7.7G 492K 7.7G 1% /run
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
/dev/nvme0n1p1 30G 5.5G 25G 19% /
- free -m:
total used free shared buff/cache available
Mem: 15731 2201 8419 0 5110 13390
Swap: 2047 0 2047
Supporting Log Snippets
I collected logs using ecs-logs-collector, but I am uncomfortable sharing them here. Can you provide me with another method for sending logs?
Hello @zahorniak,
Thank you for the detailed explanation. Would you be able to send the logs for further investigation to this email address: [email protected]?
Thank you
Hey @hozkaya2000,
I sent you the logs as soon as you provided your email. Sorry for forgetting to mention it here.
Hi @zahorniak, thanks for raising this issue.
I discovered that ECS Agent frees up container instance resources immediately after sending a stop signal to the old container.
Just to add a bit more context, this is the expected behavior for agent where reported host resources will only be marked as free after agent sends a stop task state change. Agent will/should only send a stop task state change as well as clean up the task resources once the known status of the container is terminal (i.e. stopped).
For example, when I monitor available CPU and Memory for Container instances in the ECS cluster Infrastructure tab, I can see that ECS Agent says that the second EC2 has 1024 CPUs and 8359 Memory available even though there are still two active containers (one new and the one old in stopping state)
Hm, may I ask which agent as well as AMI version you're currently using? There was a change in regards to our task launch behavior sometime last year (2023). Please check out the following public AWS document for more information -> https://aws.amazon.com/blogs/containers/improvements-to-amazon-ecs-task-launch-behavior-when-tasks-have-prolonged-shutdown/
Hi @zahorniak . In addition to the information shared by @mye956 above, the following two configuration options might be useful for you use case. Have you tried these already?
Service deployment configuration option 'maximumPercent' -
maximumPercent Type: Integer Required: No If a service is using the rolling update (ECS) deployment type, the maximumPercent parameter represents an upper limit on the number of your service's tasks that are allowed in the RUNNING, STOPPING, or PENDING state during a deployment. It is expressed as a percentage of the desiredCount that is rounded down to the nearest integer. You can use this parameter to define the deployment batch size. For example, if your service is using the REPLICA service scheduler and has a desiredCount of four tasks and a maximumPercent value of 200%, the scheduler might start four new tasks before stopping the four older tasks. This is provided that the cluster resources required to do this are available. The default maximumPercent value for a service using the REPLICA service scheduler is 200%. If your service is using the DAEMON service scheduler type, the maximumPercent should remain at 100%. This is the default value. The maximum number of tasks during a deployment is the desiredCount multiplied by the maximumPercent/100, rounded down to the nearest integer value. If a service is using either the blue/green (CODE_DEPLOY) or EXTERNAL deployment types and tasks that use the EC2 launch type, the maximum percent value is set to the default value and is used to define the upper limit on the number of the tasks in the service that remain in the RUNNING state while the container instances are in the DRAINING state. If the tasks in the service use the Fargate launch type, the maximum percent value isn't used, although it's returned when describing your service.
If you set it to 100% I think Scheduler won't start a replacement task until a STOPPED task has been identified for it.
ECS Agent's ECS_CONTAINER_STOP_TIMEOUT configuration option.
ECS_CONTAINER_STOP_TIMEOUT Instance scoped configuration for time to wait for the container to exit normally before being forcibly killed.
What's the value for this option on your container instances? Default is 10 minutes so I don't follow why a container running a long job doesn't stop for several hours in your case.
ECS Agent wrongly tells that EC2 has available resources from the stopped container, but it's still active and running in real life.
Also on this point, ECS Agent does not report available resources to ECS backend at all. ECS backend has its own resource accounting logic which is independent of ECS Agent's resource accounting logic.
Hi @amogh09,
Thanks for your input and questions. I'll try to answer if that's okay with you.
Yes, we are using this parameter, which is set at 200% for our service. The problem is not that ECS doesn't start new containers; the problem is that ECS is trying to start a container on the EC2 instance, which is already at full capacity.
ECS Agent's ECS_CONTAINER_STOP_TIMEOUT configuration option.
We do not set this option for the ECS Agent. Instead, we set the stopTimeout option at the ECS Service level to a few hours to make sure that all our jobs are successfully finished.
Right now, we're implementing a Task Scale-In protection mechanism for our ECS Services, as @mye956 suggested. We will begin testing it this week and hopefully receive positive results in a few days/weeks.