AWS ECS task stuck in pending state
Summary
AWS ECS task stuck in pending state
Description
I am using rails and have deployed my server on AWS ECS with two tasks app server and sidekiq server. Sometimes, once or twice in the week, my app server tasks reduce to 0 and all the tasks get stuck in a pending state. I either kill the task manually or run a new deployment or something to get the task running again. Please advise how to fix this? I have added some logs here:
Expected Behavior
No tasks should be in a pending state.
Observed Behavior
A lot of tasks are in a pending state for a very long
Environment Details
Examples:
- docker info
[ec2-user@ip-10-180-25-9 ~]$ docker info
Client:
Context: default
Debug Mode: false
Server:
Containers: 14
Running: 4
Paused: 0
Stopped: 10
Images: 6
Server Version: 20.10.13
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: ext4
Udev Sync Supported: true
Data Space Used: 19.75GB
Data Space Total: 23.33GB
Data Space Available: 3.576GB
Metadata Space Used: 3.351MB
Metadata Space Total: 25.17MB
Metadata Space Available: 21.82MB
Thin Pool Minimum Free Space: 2.333GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
Default Runtime: runc
Init Binary: docker-init
containerd version: 9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
runc version: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.62-21.56.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.456GiB
Name: ip-10-180-25-9
ID: MNU6:STOP:DMMH:WYPV:7CUY:BAJ2:7WQM:KWSA:IHV2:PZAE:QZEU:PMRI
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
- curl http://localhost:51678/v1/metadata
{"Cluster":"ac-prod","ContainerInstanceArn":"arn:aws:ecs:us-east-1:277312685707:container-instance/ac-prod/4b4514afbe7e41548145ff1f2ad9127b","Version":"Amazon ECS Agent - v1.51.0 (5c821610)"}
- df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 7.8G 1.1G 6.6G 15% /
devtmpfs 3.8G 104K 3.8G 1% /dev
tmpfs 3.8G 0 3.8G 0% /dev/shm
Supporting Log Snippets
I also tried running this command to fetch docker status:
curl 127.0.0.1:51678/v1/tasks | jq '[.Tasks[] | {task: "\(.Family):\(.Version)",status: "\(.KnownStatus) -> \(.DesiredStatus)", dockerId: .Containers[0] | .DockerId}] | sort_by(.task,.dockerId)'
[
{
"task": "almaconnect-app-server:4155",
"status": "PENDING -> RUNNING",
"dockerId": null
},
{
"task": "almaconnect-app-server:4155",
"status": "RUNNING -> STOPPED",
"dockerId": "c122fdb012177deb59f6d9d7bc99bfe0f562d0bd7d8bc3f3af0f0782a7ae5ff5"
},
{
"task": "almaconnect-delayedjob:3955",
"status": "PENDING -> RUNNING",
"dockerId": null
},
{
"task": "almaconnect-seo-server:4067",
"status": "PENDING -> RUNNING",
"dockerId": null
},
{
"task": "karmabox-app-server:756",
"status": "STOPPED -> STOPPED",
"dockerId": "851d4a89b534f0e8b411a73474774d8fabb0fb404c68c61e4605ab2172df3a7a"
},
{
"task": "karmabox-app-server:757",
"status": "RUNNING -> RUNNING",
"dockerId": "f58260ebf25dec72f48d62ce5c12de7d2cb2e3ddde4add25b62921164a479358"
},
{
"task": "karmabox-sidekiq:772",
"status": "STOPPED -> STOPPED",
"dockerId": "0daf4a9f93ed1997d747025c277c0e220305382ea68a6d41b8f90e80c652e418"
},
{
"task": "karmabox-sidekiq:772",
"status": "STOPPED -> STOPPED",
"dockerId": "1f929c72c847b413e224e08dcc5bcfa5d512b38b7fcb1ce646387cbdeab0634d"
},
{
"task": "karmabox-sidekiq:772",
"status": "STOPPED -> STOPPED",
"dockerId": "589c6ca8e73d15987b76fa1ac1fd90ab3866c2d5dad826a20343b238b8da2d18"
},
{
"task": "karmabox-sidekiq:772",
"status": "STOPPED -> STOPPED",
"dockerId": "a207efbeb467372be9c03f9dcada851b3ace1d37fe67eac04766427b89d35d66"
},
{
"task": "karmabox-sidekiq:772",
"status": "STOPPED -> STOPPED",
"dockerId": "a5bbaa346e594112f0015e3a9967c3b47ce1f01e486b3a9245c7579a23998434"
},
{
"task": "karmabox-sidekiq:772",
"status": "STOPPED -> STOPPED",
"dockerId": "daf986c13ba0fc8302c176540c0a5870a213642367f754e9775a9d367aa01b83"
},
{
"task": "karmabox-sidekiq:773",
"status": "PENDING -> RUNNING",
"dockerId": null
},
{
"task": "karmabox-sidekiq:773",
"status": "STOPPED -> STOPPED",
"dockerId": "76d70abf2d27f8243f4f7946e1be4287b9b13e40792564eda0959560125336fc"
},
{
"task": "karmabox-sidekiq:773",
"status": "STOPPED -> STOPPED",
"dockerId": "b5b4b7618c11409287c448b441aa296a2bcaea266a57276cc1e977044167f8ac"
},
{
"task": "karmabox-sidekiq:773",
"status": "STOPPED -> STOPPED",
"dockerId": "eab5c542ab5dd048f36cdf03642dcb00d8a0ec1267d9f32c39613871407f2ce1"
},
{
"task": "karmabox-sidekiq:773",
"status": "STOPPED -> STOPPED",
"dockerId": "f35c14266cc915108bbd766be6ff3483888508d020a014491b788be4f9b6d62b"
}
]
Apparently, all the tasks which have docker id as null and status stopped are in a pending state in the UI and hence no new tasks are being created.
Hi @siddhant-mohan , Thanks for reporting this!
I see you are running your server on Amazon Linux 1. Do you have any specific reason to not use AL2? Migrate to AL2 is highly suggested since AL1 is only receiving patches for critical CVEs, and latest ECS Agent versions are usually not supported.
Going back to the issue you are facing, if there is no dockerId associated with a container, that means a container is not able to be created in docker engine. Task with state "PENDING -> RUNNING" means it is attempting to transit from PENDING to RUNNING, and in PENDING state we are not expecting it to have a dockerId because it hasn't been successfully created in Docker Engine.
Can you please try to see if you can reproduce it on AL2 (potentially on ECS Optimized AMI if it suits your purpose), if there are no blockers to use AL2? If you have to use AL1, please use ecs-log-collector following instructions on https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html to collect more info and send log files to [email protected] for us to trouble shoot.
Just FYI: aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended can help you to get latest ECS Optimized AMI if you plan to use it.
@siddhant-mohan maybe it's connected to https://github.com/aws/containers-roadmap/issues/325
ECS has native bug which makes tasks to wait for all other tasks to be killed in the instance, however it schedule new tasks as Pending to the instance meanwhile. The issue is actually that you have this task:
Which is pending to be stopped. In this situation, any other new tasks that scheduled in the same instance are kept as PENDING until this task will be stopped.
Closing due to inactivity -- please re-open if you are able to follow up on @Realmonia's recommendations and supply more information.