amazon-ecs-agent
amazon-ecs-agent copied to clipboard
ECS agent disconnects instances but autoscalling not working properly after that
Summary
We have a cluster with some GPU instances working, they work as expected normally, but every now and then, we start having instances disconnecting from the cluster but they are still up in EC2, just not reporting anything to the cluster. for example when the only instance up get disconnected in this way we have a gap in the report of the resources usage
If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.
The thing is, this kind of instances it seems to affect the autoscalling, because while those are not working as expected, the autoscaling don't start more instances and we can have 1 hour without any instance running.
We are using the last ECS agent version and the last AMI image for it.
Looking at the ECS logs, we don't have enough information to debug what can be happening:
level=info time=2024-01-30T11:03:13Z msg="End of eligible images for deletion" managedImagesRemaining=1
level=info time=2024-01-30T11:11:11Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-01-30T11:11:11Z msg="Using cached DiscoverPollEndpoint" containerInstanceARN="arn:aws:ecs:us-east-1:4875549089326412330:container-instance/production-ecs-cluster/0079405150f5345f4bf1324234gb5h6446bf3bb8273" endpoint="https://ecs-a.us-east-1.amazonaws.com/acs/31/" telemetryEndpoint="https://ecs-t.us-east-1.amazonaws.com/tcs/31/" serviceConnectEndpoint="https://ecs-a.us-east-1.amazonaws.com"
docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)
Server:
Containers: 3
Running: 1
Paused: 0
Stopped: 2
Images: 5
Server Version: 20.10.25
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: amazon-ecs-volume-plugin local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.334-252.552.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.23GiB
Name: ip-10-30-1-250.ec2.internal
ID: MU72:QWOM:NEUO:JDMS:Y47J:VEY6:IU3F:4CMD:KBBR:ESHN:PFMI:75E3
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Not sure if the issue is with the ECS agent that get's disconnected or it can be another issue. the logs doesn't give us enough information to debug