amazon-ssm-agent
amazon-ssm-agent copied to clipboard
SSM agent under Fargate using the new ECS Exec feature is crashing
Hey there,
We are trying to setup the new AWS ECS Exec with our Fargate services to be able to run commands on tasks, we followed the setup article but somehow the SSM agent goes into "STOPPED" status when the task starts, i was able to check the agent logs inside the container and what i can see is this error :
user@ip-xx-xxx-xxx-xx:/opt/app$ sudo cat /var/log/amazon/ssm/errors.log
2021-03-24 20:14:23 ERROR [run @ agent.go.104] error occurred when starting amazon-ssm-agent: failed to start message bus, failed to start health channel: failed to listen on the channel: ipc:///var/lib/amazon/ssm/ipc/health, address in use
Also checking the running processes inside the container, i can see the two ssm
processes (amazon-ssm-agent
and ssm-agent-worker
), and there is no duplication of the SSM agent process that might explain the "address in use" error :
root 21 0.0 0.3 1398768 14552 ? Ssl 20:24 0:00 /managed-agents/execute-command/amazon-ssm-agent
root 40 0.0 0.8 1336320 32196 ? Sl 20:24 0:00 /managed-agents/execute-command/ssm-agent-worker
amazon-ssm-agent version: v3.1.36.0
Linux : Debian GNU/Linux 10 (buster)
Any idea why this is happening ??
I did some further investigations on this to see what process is using the file /var/lib/amazon/ssm/ipc/health
and i can see that only one process (pid21 from above) which is the amazon-ssm-agent
is accessing that file :
root@ip-xx-xxx-xxx-xxx:/opt/app# lsof /var/lib/amazon/ssm/ipc/health
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
amazon-ss 21 root 10u unix 0xffff8880bd03b000 0t0 23964 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss 21 root 11u unix 0xffff8880c0dd7c00 0t0 21918 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss 21 root 15u unix 0xffff8880bbff2000 0t0 22317 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss 21 root 16u unix 0xffff8880bd03dc00 0t0 24945 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss 21 root 17u unix 0xffff8880bbe52c00 0t0 25002 /var/lib/amazon/ssm/ipc/health type=STREAM
amazon-ss 21 root 18u unix 0xffff8880e7969000 0t0 28820 /var/lib/amazon/ssm/ipc/health type=STREAM
So the error is still something odd and inexplicable at this time giving that only one ssm-agent process is accessing the /var/lib/amazon/ssm/ipc/health
file.
Hey there! Isn't this expected behaviour as per article:
This is made possible by bind-mounting the necessary SSM agent binaries into the container.
So the SSM Agent will be mounted on to the container and run. IMO, you don't need both the ECS Exec feature and AWS SSM agent to run commands at the same time, both will give you access inside the container so you can choose either one option depending on your usecase.
If you simply want to do debugging of an app inside the container, then you can use the ECS exec option and it will auto mount the SSM agent for you. You don't need to install the amazon-ssm-agent again inside the container.
If you want to just run a SSM managed session inside the container or run commands, then you can manually install the aws-ssm-agent into the image and not have to enable the ECS exec feature for the task.
Hope this helps! If not, maybe you can explain a bit what your usecase is about.
@dsouzajude as you noted, the ecs exec
command uses the SSM agent. I hit this bug multiple times a week - probably half my containers have this issue. I do not install the SSM agent manually. All I want to do is use ecs exec
.
We hit the same issue here. Using Fargate and not install ssm agent. The “not always work” workaround solution is killing the task so the ecs service could launch a new task. Sometimes it works, sometimes it does not.
Same issue, https://github.com/aws-containers/amazon-ecs-exec-checker reports that everything is setup correctly.
But attempts to connect crash the agent.
@dsouzajude We are not installing any side amazon-ssm-agent
inside the container, we followed step-by-step the official AWS article to setup ECS EXEC with our Fargate environment. the amazon-ssm-agent
shown in the results i attached above is from enabling ECS EXEC with the Fargate task.
sorry closed the issue by mistake! This is still happening and preventing us from widely adopting the ECS EXEC feature with Fargate.
I do face this issue...any resolution on this??
works for me!!!!! it was my image's behaviour.
On my project we are seeing the same issue, where sometimes containers cannot be 'ECS Exec'ed into, and aws ecs describe-tasks
shows ExecuteCommandAgent
is STOPPED
rather than RUNNING
. We can stop those containers, and they are replaced with new ones, and ECS Exec then generally works, but it's not clear why the agent is stopping. Is there some way to at least have a container where the agent is stopped fail healthchecks and get replaced?
Any update on this? the issue is still bitting us, most of our Fargate containers fail to accept remote execute commands from ecs exec
, the SSM agent shows as STOPPED
in all this affected Fargate tasks, and as mentioned by many people the issue happens randomly,
I even opened a support ticket using our AWS support tier in the past and they confirmed the issue and said ECS+SSM technical team rolled a fix but it seems that didn't fix it, their explanation of the issue :
The issue is that the Fargate agent lost the SSM agent status because the process somehow lost the uuid set by containerd. The SSM agent is actually running inside the container. Once it retries to start the SSM agent process but since the SSM agent is actually already running, it can't be started again, it is throwing the address in use error.
The confirmation of the fix they rolled out :
I see that the ECS+SSM team had rolled out an update for the fix and date of completion for this deployments was between 2nd - 5th August 2021.
This issue should be prioritized as i believe many Fargate customers are impacted by this issue if they use ecs exec
feature!
Similar ticket was also reported in your AWS Forum https://forums.aws.amazon.com/message.jspa?messageID=980336!
I had this issue a lot when ECS Exec first launched and it did seem like it got fixed but it now seems to have completely regressed. About a week ago I couldn't login to any container as the agent had stopped on all of them. I kept launching extra containers and it didn't work until the 4th.
According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.
The trick is in using a target parameter that is in the format
ecs:clustername_taskid_containerruntimeid
Then you can run something like this to run a command on the container
aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'
Or to get an interactive session to the container, just
aws ssm start-session --target ecs:clustername_taskid_containerruntimeid
This has gotten significantly worse for me in the past few weeks. Not sure what changed, but feels like a regression.
Having the same problem above (ECS Fargate). The odd thing is that sometimes it works on some tasks and sometimes it doesn't. We didn't have this issue before, I believed it started around early November for us.
@mtommila The workaround is working, but it has limitations compared to ECS Exec, one of them is that you lose the default integration with S3 bucket for logs and audit, and it seems also that aws ssm start-session
doesn't support granular IAM permissions to limit access compared to ECS EXEC which does support more policy condition keys!
According to AWS support, a workaround is to use "ssm start-session" instead of ECS Exec. It essentially seems to do the same thing but it should work even when ECS Exec is failing.
The trick is in using a target parameter that is in the format
ecs:clustername_taskid_containerruntimeid
Then you can run something like this to run a command on the container
aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'
Or to get an interactive session to the container, just
aws ssm start-session --target ecs:clustername_taskid_containerruntimeid
Having this issue myself, the agent is just STOPPED, no way to get in. =/
Would love to see a fix.
Having the same issue. Using CloudFormation to deploy multiple ECS Fargate microservices. Used https://github.com/aws-containers/amazon-ecs-exec-checker to verify our set-up. Sometimes a task will have "running" for "Managed Agent Status" across all containers in the task, sometimes a task will have 1 or 2 containers as "stopped" and the remaining container in the task as "running". "Managed Agent Status" will sometimes be "stopped" on init (fresh deployment) and other times after some period of time. Since we're using CFN for deployments the inconsistencies in the "Managed Agent Status" is confusing. We'll see the "Managed Agent Status" crash and sometimes run on a Corretto container, same for DataDog and Fluent-Bit containers ... i.e., sometimes the agent works for days, sometimes it borks nearly right away.
Follow-up: aws ssm start-session works in every case, even when for whatever reason the "Managed Agent Status" crashes on a container.
Having the same issue. Used https://github.com/aws-containers/amazon-ecs-exec-checker to verify our set-up. most of the times a task will have "stopped" for "Managed Agent Status" across a few containers in the task. Can this be prioritized?
I am having the same issue. ECS exec was previously working reliably as I had setup my task execution role to have these permissions: { "Effect": "Allow", "Action": [ "ssmmessages:CreateControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenControlChannel", "ssmmessages:OpenDataChannel" ], "Resource": "*" }
Then it stopped working after the SSM upgrade (which we do not control). After som experimentation, I was able to get exec to work again by granting my application's AWS user account these same permissions. Why? Well my container has the AWS credentials injected into via environment file (as supported by task definitions). SSM seems to have changed to authenticate using whatever AWS credentials exist in the environment of the container. This is wrong, it should be using the exec role I believe. Not sure why this changed.
I believe the SSM security behavior should be reverted to use whatever the security context is of the task execution role, so that we do not have to grant our application user these permissions, and we can also continue to use environment variables for our application credentials.
This frustrated me so much, and happened so often, that I wrote a script to fallback to ssm start-session
which works every single time.
I prefix my resources with the environment type (e.g. prod, beta). Feel free to remove the $env
variable if you don't use a prefix for your resources; it may take a little massaging to work.
This also allows you to pass in an offset
, so if a task is having a problem you can just increment the offset
variable and get another task.
https://gist.github.com/nscott/169bbf6a10f4c4fbd6194b3cdc5707b7
Anyone that is having issues have you tried using https://github.com/tedsmitt/ecsgo ? Curious if its something in the AWS CLI or the actual API.
If it doesn't work in ecsgo I am sure it wouldn't be hard to add that feature to fall back.
I haven't tried, but it's not something in the API either. It's the service on the container crashing. In my container start script I have a fix_ssm
function.
For a while I tried to explicitly kill it and restart it. I have it set just to log at this point and I'll probably turn off the logging since it's so intermittent.
function fix_ssm() {
echo "Trying to fix SSM"
lsof /var/lib/amazon/ssm/ipc/health
ps aux
PID_TO_KILL=$(pidof /managed-agents/execute-command/amazon-ssm-agent)
echo "Killing SSM agent ID " $PID_TO_KILL
kill -9 $PID_TO_KILL
rm -rf /var/lib/amazon/ssm/ipc/health
echo "Relaunching SSM agent"
/managed-agents/execute-command/amazon-ssm-agent &
}
# https://stackoverflow.com/questions/65218749/unable-to-start-the-amazon-ssm-agent-failed-to-start-message-bus
# https://forums.aws.amazon.com/message.jspa?messageID=981199#981199
# https://github.com/aws/amazon-ssm-agent/issues/361
# Use || true to always allow the command to succeed, even on development containers
# The SSM agent will be installed automatically on AWS ECS Fargate
(sleep 30 && (fix_ssm || true)) &
echo "Tailing SSM agent logs"
tail -f /var/log/amazon/ssm/amazon-ssm-agent.log &
lsof /var/lib/amazon/ssm/ipc/health || true
It also makes me furious that my forum post was archived with no way to view it even as an archive.
The agent often dies/is dead. There's a bunch of logging output I've captured in the past but at this point I just gave up and accept that it's not going to work 100% of the time. The ssm start-session
fallback isn't as good since it doesn't drop you into the expected path, I don't know if the audit trail is the same, etc.
Another piece of feature creep in AWS that's very helpful but won't be supported correctly.
Interestingly looking at SSM agent releases the version pushed by fargate at the moment is [Release 3.1.1260.0 - 2022-04-12] the next version has what seems to be a few bug fixes for initialization. https://github.com/aws/amazon-ssm-agent/releases
I am having the same issue. ECS exec was previously working reliably as I had setup my task execution role to have these permissions: { "Effect": "Allow", "Action": [ "ssmmessages:CreateControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenControlChannel", "ssmmessages:OpenDataChannel" ], "Resource": "*" }
Then it stopped working after the SSM upgrade (which we do not control). After som experimentation, I was able to get exec to work again by granting my application's AWS user account these same permissions. Why? Well my container has the AWS credentials injected into via environment file (as supported by task definitions). SSM seems to have changed to authenticate using whatever AWS credentials exist in the environment of the container. This is wrong, it should be using the exec role I believe. Not sure why this changed.
I believe the SSM security behavior should be reverted to use whatever the security context is of the task execution role, so that we do not have to grant our application user these permissions, and we can also continue to use environment variables for our application credentials.
Thanks @jtsinnott for reaching us. The issue that you had mentioned is resolved and further information about it can be found in this link - https://github.com/aws/amazon-ssm-agent/issues/435
@VishnuKarthikRavindran How often does the ECS team upgrade the SSM agent in Fargate?
Add a 👍 to https://github.com/aws/containers-roadmap/issues/1756 if you want the SSM version bumped.
aws ssm start-session --target ecs:clustername_taskid_containerruntimeid
Error when calling StartSession operation:
my command is aws ssm start-session --target ecs:UltimaF_838d773b17954bcfbbacf343fb4fea70_838d773b17954bcfbbacf343fb4fea70-2587323273
which is ecs:clustername_task_containerruntimeid
Any help/hints would be appreciated!
aws ssm start-session --target ecs:clustername_taskid_containerruntimeid --document-name AWS-StartInteractiveCommand --parameters '{"command":["whatever command"]}'
Hi! I'm facing the same issue too, however, this command throws me an error for Target not being connected. Did you do something to register the instance? I've my cluster running in private subnet and the exec check is reporting all green checks.
The logic of building up the instance as you say makes sense to me, upon all exec-command runs I got in Session History an entry with this structure for the instance id, but it got Terminated always within 3 seconds..