amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

AWS ECS task stuck in pending state

Open siddhant-mohan opened this issue 3 years ago • 2 comments

Summary

AWS ECS task stuck in pending state

Description

I am using rails and have deployed my server on AWS ECS with two tasks app server and sidekiq server. Sometimes, once or twice in the week, my app server tasks reduce to 0 and all the tasks get stuck in a pending state. I either kill the task manually or run a new deployment or something to get the task running again. Please advise how to fix this? I have added some logs here:

Expected Behavior

No tasks should be in a pending state.

Observed Behavior

A lot of tasks are in a pending state for a very long

Environment Details

Examples:

  • docker info
[ec2-user@ip-10-180-25-9 ~]$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 14
  Running: 4
  Paused: 0
  Stopped: 10
 Images: 6
 Server Version: 20.10.13
 Storage Driver: devicemapper
  Pool Name: docker-docker--pool
  Pool Blocksize: 524.3kB
  Base Device Size: 10.74GB
  Backing Filesystem: ext4
  Udev Sync Supported: true
  Data Space Used: 19.75GB
  Data Space Total: 23.33GB
  Data Space Available: 3.576GB
  Metadata Space Used: 3.351MB
  Metadata Space Total: 25.17MB
  Metadata Space Available: 21.82MB
  Thin Pool Minimum Free Space: 2.333GB
  Deferred Removal Enabled: true
  Deferred Deletion Enabled: true
  Deferred Deleted Device Count: 0
  Library Version: 1.02.135-RHEL7 (2016-11-16)
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
 runc version: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.9.62-21.56.amzn1.x86_64
 Operating System: Amazon Linux AMI 2017.09
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.456GiB
 Name: ip-10-180-25-9
 ID: MNU6:STOP:DMMH:WYPV:7CUY:BAJ2:7WQM:KWSA:IHV2:PZAE:QZEU:PMRI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • curl http://localhost:51678/v1/metadata
{"Cluster":"ac-prod","ContainerInstanceArn":"arn:aws:ecs:us-east-1:277312685707:container-instance/ac-prod/4b4514afbe7e41548145ff1f2ad9127b","Version":"Amazon ECS Agent - v1.51.0 (5c821610)"}
  • df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  7.8G  1.1G  6.6G  15% /
devtmpfs        3.8G  104K  3.8G   1% /dev
tmpfs           3.8G     0  3.8G   0% /dev/shm

Supporting Log Snippets

I also tried running this command to fetch docker status:

curl 127.0.0.1:51678/v1/tasks | jq '[.Tasks[] | {task: "\(.Family):\(.Version)",status: "\(.KnownStatus) -> \(.DesiredStatus)", dockerId: .Containers[0] | .DockerId}] | sort_by(.task,.dockerId)'


[
  {
    "task": "almaconnect-app-server:4155",
    "status": "PENDING -> RUNNING",
    "dockerId": null
  },
  {
    "task": "almaconnect-app-server:4155",
    "status": "RUNNING -> STOPPED",
    "dockerId": "c122fdb012177deb59f6d9d7bc99bfe0f562d0bd7d8bc3f3af0f0782a7ae5ff5"
  },
  {
    "task": "almaconnect-delayedjob:3955",
    "status": "PENDING -> RUNNING",
    "dockerId": null
  },
  {
    "task": "almaconnect-seo-server:4067",
    "status": "PENDING -> RUNNING",
    "dockerId": null
  },
  {
    "task": "karmabox-app-server:756",
    "status": "STOPPED -> STOPPED",
    "dockerId": "851d4a89b534f0e8b411a73474774d8fabb0fb404c68c61e4605ab2172df3a7a"
  },
  {
    "task": "karmabox-app-server:757",
    "status": "RUNNING -> RUNNING",
    "dockerId": "f58260ebf25dec72f48d62ce5c12de7d2cb2e3ddde4add25b62921164a479358"
  },
  {
    "task": "karmabox-sidekiq:772",
    "status": "STOPPED -> STOPPED",
    "dockerId": "0daf4a9f93ed1997d747025c277c0e220305382ea68a6d41b8f90e80c652e418"
  },
  {
    "task": "karmabox-sidekiq:772",
    "status": "STOPPED -> STOPPED",
    "dockerId": "1f929c72c847b413e224e08dcc5bcfa5d512b38b7fcb1ce646387cbdeab0634d"
  },
  {
    "task": "karmabox-sidekiq:772",
    "status": "STOPPED -> STOPPED",
    "dockerId": "589c6ca8e73d15987b76fa1ac1fd90ab3866c2d5dad826a20343b238b8da2d18"
  },
  {
    "task": "karmabox-sidekiq:772",
    "status": "STOPPED -> STOPPED",
    "dockerId": "a207efbeb467372be9c03f9dcada851b3ace1d37fe67eac04766427b89d35d66"
  },
  {
    "task": "karmabox-sidekiq:772",
    "status": "STOPPED -> STOPPED",
    "dockerId": "a5bbaa346e594112f0015e3a9967c3b47ce1f01e486b3a9245c7579a23998434"
  },
  {
    "task": "karmabox-sidekiq:772",
    "status": "STOPPED -> STOPPED",
    "dockerId": "daf986c13ba0fc8302c176540c0a5870a213642367f754e9775a9d367aa01b83"
  },
  {
    "task": "karmabox-sidekiq:773",
    "status": "PENDING -> RUNNING",
    "dockerId": null
  },
  {
    "task": "karmabox-sidekiq:773",
    "status": "STOPPED -> STOPPED",
    "dockerId": "76d70abf2d27f8243f4f7946e1be4287b9b13e40792564eda0959560125336fc"
  },
  {
    "task": "karmabox-sidekiq:773",
    "status": "STOPPED -> STOPPED",
    "dockerId": "b5b4b7618c11409287c448b441aa296a2bcaea266a57276cc1e977044167f8ac"
  },
  {
    "task": "karmabox-sidekiq:773",
    "status": "STOPPED -> STOPPED",
    "dockerId": "eab5c542ab5dd048f36cdf03642dcb00d8a0ec1267d9f32c39613871407f2ce1"
  },
  {
    "task": "karmabox-sidekiq:773",
    "status": "STOPPED -> STOPPED",
    "dockerId": "f35c14266cc915108bbd766be6ff3483888508d020a014491b788be4f9b6d62b"
  }
]

Apparently, all the tasks which have docker id as null and status stopped are in a pending state in the UI and hence no new tasks are being created.

siddhant-mohan avatar May 26 '22 16:05 siddhant-mohan

Hi @siddhant-mohan , Thanks for reporting this!

I see you are running your server on Amazon Linux 1. Do you have any specific reason to not use AL2? Migrate to AL2 is highly suggested since AL1 is only receiving patches for critical CVEs, and latest ECS Agent versions are usually not supported.

Going back to the issue you are facing, if there is no dockerId associated with a container, that means a container is not able to be created in docker engine. Task with state "PENDING -> RUNNING" means it is attempting to transit from PENDING to RUNNING, and in PENDING state we are not expecting it to have a dockerId because it hasn't been successfully created in Docker Engine.

Can you please try to see if you can reproduce it on AL2 (potentially on ECS Optimized AMI if it suits your purpose), if there are no blockers to use AL2? If you have to use AL1, please use ecs-log-collector following instructions on https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html to collect more info and send log files to [email protected] for us to trouble shoot.

Just FYI: aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended can help you to get latest ECS Optimized AMI if you plan to use it.

Realmonia avatar Jul 28 '22 21:07 Realmonia

@siddhant-mohan maybe it's connected to https://github.com/aws/containers-roadmap/issues/325

ECS has native bug which makes tasks to wait for all other tasks to be killed in the instance, however it schedule new tasks as Pending to the instance meanwhile. The issue is actually that you have this task: image Which is pending to be stopped. In this situation, any other new tasks that scheduled in the same instance are kept as PENDING until this task will be stopped.

Alonreznik avatar Aug 15 '22 08:08 Alonreznik

Closing due to inactivity -- please re-open if you are able to follow up on @Realmonia's recommendations and supply more information.

fierlion avatar Aug 24 '22 22:08 fierlion