amazon-ecs-agent ECS Tasks report target not connect

Note: This was first posted to aws-cli and directed here. https://github.com/aws/aws-cli/issues/9406#issuecomment-2787059730

Describe the bug I am trying to run an exec command on a task in an ECS cluster and I continue to get TargetNotConnectedException. I have run the exec checker and it looks like everything is set up correctly. I updated my ssm (I hope?) with the host management in systems manager. I'm not sure if this is a bug or if there is some bit of configuration I am missing. I am using the latest ecs optimized images. I'm haven't ssh'd to the ec2 instances directly (they have no internet access) but assume the ecs images should have everything. I also refreshed the images and added dnf -y install https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm in the user data just in case its not in the ecs image but still get the same error.

I have searched and found others with the issue but its usually something like having aws keys in env variables which I do not have. I'm pasting below my output from the exec checker in case I'm not seeing something.

AWS_REGION=us-east-2 bash <( curl -Ls https://raw.githubusercontent.com/aws-containers/amazon-ecs-exec-checker/main/check-ecs-exec.sh ) cdai-ecs-staging-cluster arn:aws:ecs:us-east-2::task/cdai-ecs-staging-cluster/ --region=us-east-2

Prerequisites for check-ecs-exec.sh v0.7

jq | OK (/usr/bin/jq) AWS CLI | OK (/usr/bin/aws)

Prerequisites for the AWS CLI to use ECS Exec

AWS CLI Version | OK (aws-cli/2.17.18 Python/3.9.20 Linux/6.1.129-138.220.amzn2023.x86_64 source/x86_64.amzn.2023) Session Manager Plugin | OK (1.2.707.0)

Checks on ECS task and other resources

Region : us-east-2 Cluster: cdai-ecs-staging-cluster Task : arn:aws:ecs:us-east-2::task/cdai-ecs-staging-cluster/

Cluster Configuration | Audit Logging Not Configured Can I ExecuteCommand? | arn:aws:iam:::role/bastion ecs:ExecuteCommand: allowed ssm:StartSession denied?: allowed Task Status | RUNNING Launch Type | EC2 ECS Agent Version | 1.91.1 Exec Enabled for Task | OK Container-Level Checks | ---------- Managed Agent Status ---------- 1. RUNNING for "portal_nextjs" ---------- Init Process Enabled (cdai-staging-task:102) ---------- 1. Enabled - "portal_nextjs" ---------- Read-Only Root Filesystem (cdai-staging-task:102) ---------- 1. Disabled - "portal_nextjs" Task Role Permissions | arn:aws:iam:::role/cdai-staging-task-role ssmmessages:CreateControlChannel: allowed ssmmessages:CreateDataChannel: allowed ssmmessages:OpenControlChannel: allowed ssmmessages:OpenDataChannel: allowed VPC Endpoints | Found existing endpoints for vpc-: - com.amazonaws.us-east-2.s3 - com.amazonaws.vpce.us-east-2.vpce-svc- - com.amazonaws.us-east-2.secretsmanager - com.amazonaws.us-east-2.ssmmessages Environment Variables | (cdai-staging-task:102) 1. container "portal_nextjs" - AWS_ACCESS_KEY: not defined - AWS_ACCESS_KEY_ID: not defined - AWS_SECRET_ACCESS_KEY: not defined

[ccovey@ip-172-31-6-61 ~]$ aws ecs execute-command --cluster cdai-ecs-staging-cluster --task arn:aws:ecs:us-east-2::task/cdai-ecs-staging-cluster/--container portal_nextjs --command "/bin/sh" --interactive

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later. I'm not sure if this is a bug or a config issue but I feel I have followed the steps to configure it properly and the checker reports the same. If you need more info let me know.

Regression Issue

Select this option if this issue appears to be a regression. Expected Behavior Able to execute commands on containers running on ecs

Current Behavior TargetNotConnectedException

Reproduction Steps Run something like the following command.

aws ecs execute-command --cluster cdai-ecs-staging-cluster --task arn:aws:ecs:us-east-2::task/cdai-ecs-staging-cluster/*--container portal_nextjs --command "/bin/sh" --interactive

This should execute properly but instead I receive the above error.

Possible Solution No response

Additional Information/Context No response

CLI version used aws-cli/2.17.18 Python/3.9.20 Linux/6.1.129-138.220.amzn2023.x86_64 source/x86_64.amzn.2023

Environment details (OS name and version, etc.) Amazon Linux release 2023.6.20250303 (Amazon Linux)

Apr 08 '25 17:04 ccovey

Hi, thanks for opening this issue. Have we tried different AMIs to see if you're still getting the same issue? Please also double check both your task role and IAM role permissions as well.

https://repost.aws/knowledge-center/ecs-error-execute-command

Apr 16 '25 18:04 mye956

Yes I have trouble a couple different AMI's for the ecs cluster and updated the bastion's template to the latest ami of that version. The ECS ami we are using is ami-03b6ec4ad7a4d7ab0 and ami-0d0f28110d16ee7d6 for our bastion server. The task role seems to have the correct permissions both when I inspect manually as well as from the checker. Are there other permissions outside of the checker that should be task on the task instances?

Apr 16 '25 19:04 ccovey

I see. Have we tried using the ECS Exec checker?

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec-troubleshooting.html

Apr 17 '25 16:04 mye956

yes the results from it are included in the first message.

Apr 17 '25 16:04 ccovey

(I know nothing about this feature and just happened to see this as I was browsing)

Your ARNs above are missing the account ID and task ID, but I'm not sure whether it's on purpose to censor it or if it's possibly your issue. I think you've put *s there but as long as you're actually putting in the correct IDs when you run the command, that doesn't matter.

Apr 17 '25 20:04 ziggythehamster

+1 to double checking that the command that's being executed is correct. While you're also at it, are you able to check the logs of the SSM agent container as well as collect agent logs via our log collector script (note: Agent logs are rotated every 24 hours so please ensure that the timeframe of when this issue occurred is within the 24 hours). Once collected, feel free to send it over to [email protected].

There also seems to be an issue if you're using fluentbit sidecar as well. If you're using fluentbit as the firelens container type, coudl you try without it and see if you hit the same issue? https://github.com/aws/aws-cli/issues/9070

Apr 21 '25 17:04 mye956

i provided the output in the original message and it has valid output. Yes the ids are just removed to post here. I'll work on getting logs sent over today or tomorrow. We are not using any sidecars atm.

Apr 21 '25 20:04 ccovey