amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

ECS agent disconnects instances but autoscalling not working properly after that

Open xploshioOn opened this issue 1 year ago • 3 comments

Summary

We have a cluster with some GPU instances working, they work as expected normally, but every now and then, we start having instances disconnecting from the cluster but they are still up in EC2, just not reporting anything to the cluster. for example when the only instance up get disconnected in this way we have a gap in the report of the resources usage

Screenshot 2024-01-30 at 12 26 01

If we connect to the instance, the only docker running is the ECS agent, but not the task I have assigned to that one.

The thing is, this kind of instances it seems to affect the autoscalling, because while those are not working as expected, the autoscaling don't start more instances and we can have 1 hour without any instance running.

We are using the last ECS agent version and the last AMI image for it.

Looking at the ECS logs, we don't have enough information to debug what can be happening:

level=info time=2024-01-30T11:03:13Z msg="End of eligible images for deletion" managedImagesRemaining=1
level=info time=2024-01-30T11:11:11Z msg="TCS Websocket connection closed for a valid reason"
level=info time=2024-01-30T11:11:11Z msg="Using cached DiscoverPollEndpoint" containerInstanceARN="arn:aws:ecs:us-east-1:4875549089326412330:container-instance/production-ecs-cluster/0079405150f5345f4bf1324234gb5h6446bf3bb8273" endpoint="https://ecs-a.us-east-1.amazonaws.com/acs/31/" telemetryEndpoint="https://ecs-t.us-east-1.amazonaws.com/tcs/31/" serviceConnectEndpoint="https://ecs-a.us-east-1.amazonaws.com"

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 3
  Running: 1
  Paused: 0
  Stopped: 2
 Images: 5
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1e1ea6e986c6c86565bc33d52e34b81b3e2bc71f
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.334-252.552.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.23GiB
 Name: ip-10-30-1-250.ec2.internal
 ID: MU72:QWOM:NEUO:JDMS:Y47J:VEY6:IU3F:4CMD:KBBR:ESHN:PFMI:75E3
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Not sure if the issue is with the ECS agent that get's disconnected or it can be another issue. the logs doesn't give us enough information to debug

xploshioOn avatar Jan 30 '24 12:01 xploshioOn