amazon-ssm-agent SSM agent under Fargate using the new ECS Exec feature is crashing

Hey there,

We are trying to setup the new AWS ECS Exec with our Fargate services to be able to run commands on tasks, we followed the setup article but somehow the SSM agent goes into "STOPPED" status when the task starts, i was able to check the agent logs inside the container and what i can see is this error :

user@ip-xx-xxx-xxx-xx:/opt/app$ sudo cat /var/log/amazon/ssm/errors.log
2021-03-24 20:14:23 ERROR [run @ agent.go.104] error occurred when starting amazon-ssm-agent: failed to start message bus, failed to start health channel: failed to listen on the channel: ipc:///var/lib/amazon/ssm/ipc/health, address in use

Also checking the running processes inside the container, i can see the two ssm processes (amazon-ssm-agent and ssm-agent-worker), and there is no duplication of the SSM agent process that might explain the "address in use" error :

root        21  0.0  0.3 1398768 14552 ?       Ssl  20:24   0:00 /managed-agents/execute-command/amazon-ssm-agent
root        40  0.0  0.8 1336320 32196 ?       Sl   20:24   0:00 /managed-agents/execute-command/ssm-agent-worker

amazon-ssm-agent version: v3.1.36.0 Linux : Debian GNU/Linux 10 (buster)

Any idea why this is happening ??

Mar 24 '21 20:03 youssefNM

I did some further investigations on this to see what process is using the file /var/lib/amazon/ssm/ipc/health and i can see that only one process (pid21 from above) which is the amazon-ssm-agent is accessing that file :

root@ip-xx-xxx-xxx-xxx:/opt/app# lsof /var/lib/amazon/ssm/ipc/health
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF  NODE NAME
amazon-ss  21 root   10u  unix 0xffff8880bd03b000      0t0 23964 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   11u  unix 0xffff8880c0dd7c00      0t0 21918 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   15u  unix 0xffff8880bbff2000      0t0 22317 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   16u  unix 0xffff8880bd03dc00      0t0 24945 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   17u  unix 0xffff8880bbe52c00      0t0 25002 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss  21 root   18u  unix 0xffff8880e7969000      0t0 28820 /var/lib/amazon/ssm/ipc/health type=STREAM

So the error is still something odd and inexplicable at this time giving that only one ssm-agent process is accessing the /var/lib/amazon/ssm/ipc/health file.

Mar 25 '21 12:03 youssefNM

Hey there! Isn't this expected behaviour as per article:

Article

This is made possible by bind-mounting the necessary SSM agent binaries into the container.

So the SSM Agent will be mounted on to the container and run. IMO, you don't need both the ECS Exec feature and AWS SSM agent to run commands at the same time, both will give you access inside the container so you can choose either one option depending on your usecase.

If you simply want to do debugging of an app inside the container, then you can use the ECS exec option and it will auto mount the SSM agent for you. You don't need to install the amazon-ssm-agent again inside the container.

If you want to just run a SSM managed session inside the container or run commands, then you can manually install the aws-ssm-agent into the image and not have to enable the ECS exec feature for the task.

Hope this helps! If not, maybe you can explain a bit what your usecase is about.

May 11 '21 14:05 dsouzajude

@dsouzajude as you noted, the ecs exec command uses the SSM agent. I hit this bug multiple times a week - probably half my containers have this issue. I do not install the SSM agent manually. All I want to do is use ecs exec.

Jun 20 '21 00:06 nscott

We hit the same issue here. Using Fargate and not install ssm agent. The “not always work” workaround solution is killing the task so the ecs service could launch a new task. Sometimes it works, sometimes it does not.

Jul 02 '21 01:07 baonguyen84

Same issue, https://github.com/aws-containers/amazon-ecs-exec-checker reports that everything is setup correctly.

But attempts to connect crash the agent.

Jul 02 '21 07:07 kartikrao

@dsouzajude We are not installing any side amazon-ssm-agent inside the container, we followed step-by-step the official AWS article to setup ECS EXEC with our Fargate environment. the amazon-ssm-agent shown in the results i attached above is from enabling ECS EXEC with the Fargate task.

Jul 28 '21 16:07 youssefNM

sorry closed the issue by mistake! This is still happening and preventing us from widely adopting the ECS EXEC feature with Fargate.

Jul 28 '21 16:07 youssefNM

I do face this issue...any resolution on this??

Oct 11 '21 09:10 shakscode

works for me!!!!! it was my image's behaviour.

Oct 11 '21 09:10 shakscode

On my project we are seeing the same issue, where sometimes containers cannot be 'ECS Exec'ed into, and aws ecs describe-tasks shows ExecuteCommandAgent is STOPPED rather than RUNNING. We can stop those containers, and they are replaced with new ones, and ECS Exec then generally works, but it's not clear why the agent is stopping. Is there some way to at least have a container where the agent is stopped fail healthchecks and get replaced?

Dec 21 '21 16:12 edmundcraske-bjss

Any update on this? the issue is still bitting us, most of our Fargate containers fail to accept remote execute commands from ecs exec, the SSM agent shows as STOPPED in all this affected Fargate tasks, and as mentioned by many people the issue happens randomly,

I even opened a support ticket using our AWS support tier in the past and they confirmed the issue and said ECS+SSM technical team rolled a fix but it seems that didn't fix it, their explanation of the issue :

The issue is that the Fargate agent lost the SSM agent status because the process somehow lost the uuid set by containerd. The SSM agent is actually running inside the container. Once it retries to start the SSM agent process but since the SSM agent is actually already running, it can't be started again, it is throwing the address in use error.

The confirmation of the fix they rolled out :

I see that the ECS+SSM team had rolled out an update for the fix and date of completion for this deployments was between 2nd - 5th August 2021.

This issue should be prioritized as i believe many Fargate customers are impacted by this issue if they use ecs exec feature!

Similar ticket was also reported in your AWS Forum https://forums.aws.amazon.com/message.jspa?messageID=980336!

Dec 22 '21 19:12 youssefNM

I had this issue a lot when ECS Exec first launched and it did seem like it got fixed but it now seems to have completely regressed. About a week ago I couldn't login to any container as the agent had stopped on all of them. I kept launching extra containers and it didn't work until the 4th.

Dec 22 '21 19:12 kylemacfarlane

According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.

The trick is in using a target parameter that is in the format

ecs:clustername_taskid_containerruntimeid

Then you can run something like this to run a command on the container

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Or to get an interactive session to the container, just

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

Jan 07 '22 11:01 mtommila

This has gotten significantly worse for me in the past few weeks. Not sure what changed, but feels like a regression.

Feb 01 '22 14:02 nscott

Having the same problem above (ECS Fargate). The odd thing is that sometimes it works on some tasks and sometimes it doesn't. We didn't have this issue before, I believed it started around early November for us.

Feb 09 '22 22:02 daniel-0906

@mtommila The workaround is working, but it has limitations compared to ECS Exec, one of them is that you lose the default integration with S3 bucket for logs and audit, and it seems also that aws ssm start-session doesn't support granular IAM permissions to limit access compared to ECS EXEC which does support more policy condition keys!

According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.

The trick is in using a target parameter that is in the format

ecs:clustername_taskid_containerruntimeid

Then you can run something like this to run a command on the container

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Or to get an interactive session to the container, just

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

Feb 15 '22 10:02 youssefNM

Having this issue myself, the agent is just STOPPED, no way to get in. =/

Would love to see a fix.

Feb 19 '22 04:02 GrubLord

Having the same issue. Using CloudFormation to deploy multiple ECS Fargate microservices. Used https://github.com/aws-containers/amazon-ecs-exec-checker to verify our set-up. Sometimes a task will have "running" for "Managed Agent Status" across all containers in the task, sometimes a task will have 1 or 2 containers as "stopped" and the remaining container in the task as "running". "Managed Agent Status" will sometimes be "stopped" on init (fresh deployment) and other times after some period of time. Since we're using CFN for deployments the inconsistencies in the "Managed Agent Status" is confusing. We'll see the "Managed Agent Status" crash and sometimes run on a Corretto container, same for DataDog and Fluent-Bit containers ... i.e., sometimes the agent works for days, sometimes it borks nearly right away.

Apr 04 '22 19:04 bhsp

Follow-up: aws ssm start-session works in every case, even when for whatever reason the "Managed Agent Status" crashes on a container.

Apr 04 '22 20:04 bhsp

Having the same issue. Used https://github.com/aws-containers/amazon-ecs-exec-checker to verify our set-up. most of the times a task will have "stopped" for "Managed Agent Status" across a few containers in the task. Can this be prioritized?

Apr 04 '22 21:04 g3kr

I am having the same issue. ECS exec was previously working reliably as I had setup my task execution role to have these permissions: { "Effect": "Allow", "Action": [ "ssmmessages:CreateControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenControlChannel", "ssmmessages:OpenDataChannel" ], "Resource": "*" }

Then it stopped working after the SSM upgrade (which we do not control). After som experimentation, I was able to get exec to work again by granting my application's AWS user account these same permissions. Why? Well my container has the AWS credentials injected into via environment file (as supported by task definitions). SSM seems to have changed to authenticate using whatever AWS credentials exist in the environment of the container. This is wrong, it should be using the exec role I believe. Not sure why this changed.

I believe the SSM security behavior should be reverted to use whatever the security context is of the task execution role, so that we do not have to grant our application user these permissions, and we can also continue to use environment variables for our application credentials.

Apr 07 '22 17:04 jtsinnott

This frustrated me so much, and happened so often, that I wrote a script to fallback to ssm start-session which works every single time.

I prefix my resources with the environment type (e.g. prod, beta). Feel free to remove the $env variable if you don't use a prefix for your resources; it may take a little massaging to work.

This also allows you to pass in an offset, so if a task is having a problem you can just increment the offset variable and get another task.

https://gist.github.com/nscott/169bbf6a10f4c4fbd6194b3cdc5707b7

Apr 13 '22 15:04 nscott

Anyone that is having issues have you tried using https://github.com/tedsmitt/ecsgo ? Curious if its something in the AWS CLI or the actual API.

If it doesn't work in ecsgo I am sure it wouldn't be hard to add that feature to fall back.

Jun 14 '22 13:06 andymac4182

I haven't tried, but it's not something in the API either. It's the service on the container crashing. In my container start script I have a fix_ssm function.

For a while I tried to explicitly kill it and restart it. I have it set just to log at this point and I'll probably turn off the logging since it's so intermittent.

function fix_ssm() {
  echo "Trying to fix SSM"
  lsof /var/lib/amazon/ssm/ipc/health
  ps aux
  PID_TO_KILL=$(pidof /managed-agents/execute-command/amazon-ssm-agent)
  echo "Killing SSM agent ID " $PID_TO_KILL
  kill -9 $PID_TO_KILL
  rm -rf /var/lib/amazon/ssm/ipc/health
  echo "Relaunching SSM agent"
  /managed-agents/execute-command/amazon-ssm-agent &
}

# https://stackoverflow.com/questions/65218749/unable-to-start-the-amazon-ssm-agent-failed-to-start-message-bus
# https://forums.aws.amazon.com/message.jspa?messageID=981199#981199
# https://github.com/aws/amazon-ssm-agent/issues/361
# Use || true to always allow the command to succeed, even on development containers
# The SSM agent will be installed automatically on AWS ECS Fargate
(sleep 30 && (fix_ssm || true)) &

echo "Tailing SSM agent logs"
tail -f /var/log/amazon/ssm/amazon-ssm-agent.log &

lsof /var/lib/amazon/ssm/ipc/health || true

It also makes me furious that my forum post was archived with no way to view it even as an archive.

The agent often dies/is dead. There's a bunch of logging output I've captured in the past but at this point I just gave up and accept that it's not going to work 100% of the time. The ssm start-session fallback isn't as good since it doesn't drop you into the expected path, I don't know if the audit trail is the same, etc.

Another piece of feature creep in AWS that's very helpful but won't be supported correctly.

Jun 14 '22 13:06 nscott

Interestingly looking at SSM agent releases the version pushed by fargate at the moment is [Release 3.1.1260.0 - 2022-04-12] the next version has what seems to be a few bug fixes for initialization. https://github.com/aws/amazon-ssm-agent/releases

Jun 14 '22 13:06 andymac4182

I am having the same issue. ECS exec was previously working reliably as I had setup my task execution role to have these permissions: { "Effect": "Allow", "Action": [ "ssmmessages:CreateControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenControlChannel", "ssmmessages:OpenDataChannel" ], "Resource": "*" }

Then it stopped working after the SSM upgrade (which we do not control). After som experimentation, I was able to get exec to work again by granting my application's AWS user account these same permissions. Why? Well my container has the AWS credentials injected into via environment file (as supported by task definitions). SSM seems to have changed to authenticate using whatever AWS credentials exist in the environment of the container. This is wrong, it should be using the exec role I believe. Not sure why this changed.

I believe the SSM security behavior should be reverted to use whatever the security context is of the task execution role, so that we do not have to grant our application user these permissions, and we can also continue to use environment variables for our application credentials.

Thanks @jtsinnott for reaching us. The issue that you had mentioned is resolved and further information about it can be found in this link - https://github.com/aws/amazon-ssm-agent/issues/435

Jun 16 '22 22:06 VishnuKarthikRavindran

@VishnuKarthikRavindran How often does the ECS team upgrade the SSM agent in Fargate?

Jun 16 '22 22:06 andymac4182

Add a 👍 to https://github.com/aws/containers-roadmap/issues/1756 if you want the SSM version bumped.

Jul 26 '22 20:07 matthewhembree

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid

Error when calling StartSession operation: is not connected.

my command is aws ssm start-session --target ecs:UltimaF_838d773b17954bcfbbacf343fb4fea70_838d773b17954bcfbbacf343fb4fea70-2587323273 which is ecs:clustername_task_containerruntimeid

Any help/hints would be appreciated!

Nov 08 '22 12:11 AnatolyBuga

aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'

Hi! I'm facing the same issue too, however, this command throws me an error for Target not being connected. Did you do something to register the instance? I've my cluster running in private subnet and the exec check is reporting all green checks.

The logic of building up the instance as you say makes sense to me, upon all exec-command runs I got in Session History an entry with this structure for the instance id, but it got Terminated always within 3 seconds..

Mar 30 '23 13:03 tordaale

amazon-ssm-agent amazon-ssm-agent copied to clipboard

SSM agent under Fargate using the new ECS Exec feature is crashing

amazon-ssm-agent
amazon-ssm-agent copied to clipboard