amazon-ecs-agent ECS Deployment Fails Due to Premature Resource Availability Reporting

Summary

During ECS deployments with EC2 capacity providers, tasks in "Pending" state occur when the ECS Agent prematurely reports resource availability before old containers fully stop. This misreporting leads to deployment issues, especially when containers run long-term jobs, resulting in new tasks attempting to start on already busy instances without actually deploying new EC2 instances as expected.

Description

We are using an ECS cluster with an EC2 capacity provider. The capacity provider completely controls ASG size. We always keep two tasks on each container instance according to our EC2 sizes (r5.large) and task definition parameters (1024 CPU and 7372 Memory reservation).

Usually, problems happen during the deployment period. The last scenario was the following. We had two EC2 and four containers (two on each EC2). Deployment started and sent a stop signal to all four containers. Three containers stopped successfully and were replaced almost immediately. But the fourth container was running a long-term job that usually runs for a few hours, and it kept running (this is expected behavior). So we have the following situation:

the first EC2 with two new tasks
the second EC2 with one new and one old task
one new task is stuck in the Pending state because it's trying to start on the second EC2

In the end, we have one task that is stuck in the Pending state for up to 8 hours. But what happens if all four old tasks will run long-term jobs and cannot be stopped immediately? We had this scenario, and in the end, 0 new containers started.

I discovered that ECS Agent frees up container instance resources immediately after sending a stop signal to the old container. For example, when I monitor available CPU and Memory for Container instances in the ECS cluster Infrastructure tab, I can see that ECS Agent says that the second EC2 has 1024 CPUs and 8359 Memory available even though there are still two active containers (one new and the one old in stopping state). That's why it tries to place a new container in the same EC2 where two active containers are already running.

Expected Behavior

The ECS Agent only freed up resources once the container stopped. This behavior will solve all problems: the capacity provider will deploy a new EC2 and a new task there instead of trying to utilize an already busy container instance.

Observed Behavior

ECS Agent wrongly tells that EC2 has available resources from the stopped container, but it's still active and running in real life.

Environment Details

ECS Service task definition with binpack memory placement strategy:

{
    "taskDefinitionArn": "arn:aws:ecs:eu-central-1:***:task-definition/ec2_sidekiq_production:414",
    "containerDefinitions": [
        {
            "name": "sidekiq",
            "image": "***",
            "cpu": 1024,
            "memory": 11059,
            "memoryReservation": 7372,
            "portMappings": [],
            "essential": true,
            "environment": [
                
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "stopTimeout": 28800,
            "readonlyRootFilesystem": false,
            "ulimits": [
                {
                    "name": "nofile",
                    "softLimit": 75000,
                    "hardLimit": 100000
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "***",
                    "awslogs-region": "eu-central-1",
                    "awslogs-stream-prefix": "prefix"
                }
            },
            "systemControls": []
        }
    ],
    "family": "ec2_sidekiq_production",
    "taskRoleArn": "arn:aws:iam::***:role/backend-production-ecs-task-role",
    "executionRoleArn": "arn:aws:iam::***:role/backend-production-ecs-task-execution-role",
    "networkMode": "host",
    "revision": 414,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role-network-host"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.container-ordering"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EXTERNAL",
        "EC2"
    ],
    "requiresCompatibilities": [
        "EC2"
    ]
}

docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 4
  Running: 3
  Paused: 0
  Stopped: 1
 Images: 6
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
 runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.336-256.559.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 15.36GiB
 Name: ip-10-5-95-24.eu-central-1.compute.internal
 ID: UY3D:MFD5:SRYJ:ELAF:2B2T:Z6BT:ZKD6:KRN2:MOX6:CRHI:I4IT:FOMA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

curl http://localhost:51678/v1/metadata:

{"Cluster":"main-production","ContainerInstanceArn":"arn:aws:ecs:eu-central-1:xxx:container-instance/main-production/fed4162115cd4f3e98d8313e1c5726f2","Version":"Amazon ECS Agent - v1.81.1 (*a4101a6e)"}

df -h:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.7G     0  7.7G   0% /dev
tmpfs           7.7G     0  7.7G   0% /dev/shm
tmpfs           7.7G  492K  7.7G   1% /run
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/nvme0n1p1   30G  5.5G   25G  19% /

free -m:

              total        used        free      shared  buff/cache   available
Mem:          15731        2201        8419           0        5110       13390
Swap:          2047           0        2047

Supporting Log Snippets

I collected logs using ecs-logs-collector, but I am uncomfortable sharing them here. Can you provide me with another method for sending logs?

Mar 06 '24 14:03 zahorniak

Hello @zahorniak,

Thank you for the detailed explanation. Would you be able to send the logs for further investigation to this email address: [email protected]?

Thank you

Apr 16 '24 00:04 hozkaya2000

Hey @hozkaya2000,

I sent you the logs as soon as you provided your email. Sorry for forgetting to mention it here.

Apr 18 '24 12:04 zahorniak

Hi @zahorniak, thanks for raising this issue.

I discovered that ECS Agent frees up container instance resources immediately after sending a stop signal to the old container.

Just to add a bit more context, this is the expected behavior for agent where reported host resources will only be marked as free after agent sends a stop task state change. Agent will/should only send a stop task state change as well as clean up the task resources once the known status of the container is terminal (i.e. stopped).

For example, when I monitor available CPU and Memory for Container instances in the ECS cluster Infrastructure tab, I can see that ECS Agent says that the second EC2 has 1024 CPUs and 8359 Memory available even though there are still two active containers (one new and the one old in stopping state)

Hm, may I ask which agent as well as AMI version you're currently using? There was a change in regards to our task launch behavior sometime last year (2023). Please check out the following public AWS document for more information -> https://aws.amazon.com/blogs/containers/improvements-to-amazon-ecs-task-launch-behavior-when-tasks-have-prolonged-shutdown/

Apr 24 '24 23:04 mye956

Hi @zahorniak . In addition to the information shared by @mye956 above, the following two configuration options might be useful for you use case. Have you tried these already?

Service deployment configuration option 'maximumPercent' -

maximumPercent Type: Integer Required: No If a service is using the rolling update (ECS) deployment type, the maximumPercent parameter represents an upper limit on the number of your service's tasks that are allowed in the RUNNING, STOPPING, or PENDING state during a deployment. It is expressed as a percentage of the desiredCount that is rounded down to the nearest integer. You can use this parameter to define the deployment batch size. For example, if your service is using the REPLICA service scheduler and has a desiredCount of four tasks and a maximumPercent value of 200%, the scheduler might start four new tasks before stopping the four older tasks. This is provided that the cluster resources required to do this are available. The default maximumPercent value for a service using the REPLICA service scheduler is 200%. If your service is using the DAEMON service scheduler type, the maximumPercent should remain at 100%. This is the default value. The maximum number of tasks during a deployment is the desiredCount multiplied by the maximumPercent/100, rounded down to the nearest integer value. If a service is using either the blue/green (CODE_DEPLOY) or EXTERNAL deployment types and tasks that use the EC2 launch type, the maximum percent value is set to the default value and is used to define the upper limit on the number of the tasks in the service that remain in the RUNNING state while the container instances are in the DRAINING state. If the tasks in the service use the Fargate launch type, the maximum percent value isn't used, although it's returned when describing your service.

If you set it to 100% I think Scheduler won't start a replacement task until a STOPPED task has been identified for it.

ECS Agent's ECS_CONTAINER_STOP_TIMEOUT configuration option.

ECS_CONTAINER_STOP_TIMEOUT Instance scoped configuration for time to wait for the container to exit normally before being forcibly killed.

What's the value for this option on your container instances? Default is 10 minutes so I don't follow why a container running a long job doesn't stop for several hours in your case.

May 06 '24 17:05 amogh09

ECS Agent wrongly tells that EC2 has available resources from the stopped container, but it's still active and running in real life.

Also on this point, ECS Agent does not report available resources to ECS backend at all. ECS backend has its own resource accounting logic which is independent of ECS Agent's resource accounting logic.

May 06 '24 17:05 amogh09

Hi @amogh09,

Thanks for your input and questions. I'll try to answer if that's okay with you.

Service deployment configuration option 'maximumPercent' -

Yes, we are using this parameter, which is set at 200% for our service. The problem is not that ECS doesn't start new containers; the problem is that ECS is trying to start a container on the EC2 instance, which is already at full capacity.

ECS Agent's ECS_CONTAINER_STOP_TIMEOUT configuration option.

We do not set this option for the ECS Agent. Instead, we set the stopTimeout option at the ECS Service level to a few hours to make sure that all our jobs are successfully finished.

Right now, we're implementing a Task Scale-In protection mechanism for our ECS Services, as @mye956 suggested. We will begin testing it this week and hopefully receive positive results in a few days/weeks.

May 09 '24 12:05 zahorniak