amazon-ssm-agent Session Manager start-session hangs when root partition full

I dunno if this is the appropriate place for this, but when I attempt to start a session (either via the AWS Console or the CLI--via the session-manager-plugin) onto an EC2 will a full root partition, it just hangs. In the console I get a blank screen for a long time and then eventually a blinking cursor that doesn't work.

In the CLI (via session-manager-plugin), I get a message that it's starting a session but then it just hangs. The CLI/plugin doesn't respond to Ctrl+C or Ctrl+D; in fact, I have to start a new terminal on my workstation and kill the CLI command and the plugin command.

Mar 12 '19 19:03 iancward

Thanks for reaching out to us. We will investigate this.

Apr 08 '19 16:04 nitikagoyal87

Was this ever resolved? I'm having a similar problem where I just see a black screen in the console, if I click, I get the cursor that doesn't do anything.

Sep 13 '19 14:09 Millerborn

I am also having this issue

Oct 15 '19 19:10 TajMahPaul

Hi, I'm experiencing this issue as well. Are there any updates on whether it is going to be fixed? It would be great to at least get a descriptive error message if the connection to instances with no space is impossible. Right now the AWS Console / CLI just hangs without any visible reason.

Dec 13 '19 13:12 IrinaTerlizhenko

+1 Using latest Ubuntu 18 AMI with ssm agent running and installed and necessary SSM/CloudWatch policies attached to role. Weirdest thing, it happens on some instances and not on others. Seems like a bug.

Jan 10 '20 18:01 xacaxulu

Unfortunately this is still happening. AWS, you should really do something here.

Mar 06 '20 08:03 iniinikoski

Hello,

We experience the same issue on Red Hat 7.7. We couldn't reach the instance through Session Manager once the partition /var/ was full.

On other side, we observed a different behavior, when partition /var/log was full, the machine was still reachable. Anyway, when we rely on Session Manager to get remote access to server, we would expect to have access to EC2 even if filesystem full.

Jun 30 '20 14:06 cholletjo

In all cases where Session Manager has not been able to successfully open a connection due to disk full, I'm able to use SSH to get access. We would like to remove SSH and switch solely to Session Manager, but that doesn't seem possible with longstanding issues like this and we are leaving SSH as a backup, where we can open the ports and distribute the pem key as needed during emergencies.

Aug 04 '20 15:08 rossmckelvie

This is a bit of a circular problem given that the way you're supposed to check the available disk space on a volume is by opening a terminal session in the instance. 🤦‍♀️ https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-describing-volumes.html

Sep 03 '20 17:09 crawforde

Not sure if it's totally related, I'm running into an issue with the start-session command hanging, when the target instance is offline. As @iancward mentioned, ctrl+c, etc does not exit. Closing and re-opening the terminal is necessary.

I haven't had the opportunity to really dig into it, but I took a quick look at the code for the CLI, and found this:

https://github.com/aws/aws-cli/blob/master/awscli/customizations/sessionmanager.py

        try:
            # ignore_user_entered_signals ignores these signals
            # because if signals which kills the process are not
            # captured would kill the foreground process but not the
            # background one. Capturing these would prevents process
            # from getting killed and these signals are input to plugin
            # and handling in there
            with ignore_user_entered_signals():
                # call executable with necessary input
                check_call(["session-manager-plugin",
                            json.dumps(response),
                            region_name,
                            "StartSession",
                            profile_name,
                            json.dumps(parameters),
                            endpoint_url])
            return 0

Looks like the terminate signals are being swallowed intentionally? I'm not totally sure, but I reckon this ties into things 😄

Oct 29 '20 02:10 jackdcasey

Any updates on resolving this issue? I've run into the same problem.

Oct 06 '21 07:10 twhetzel

Any updates please?

Oct 08 '21 10:10 gaalandr

I've heard I should try to increase the volume size, but it's not clear if this will delete all data on the disk.

Oct 09 '21 02:10 twhetzel

I've heard I should try to increase the volume size, but it's not clear if this will delete all data on the disk.

Hi @twhetzel, increasing the volume size should not delete the data. However, if you increase the EBS volume size, you will need to access the instance and perform some commands to extend file system to use that extra capacity. See the below links for Linux and Windows instances:

Windows: https://aws.amazon.com/premiumsupport/knowledge-center/expand-ebs-root-volume-windows/ or https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/recognize-expanded-volume-windows.html

Linux: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html

Oct 28 '21 10:10 IdrisAbdul-Hussein

This is a very important issue. I had an EC2 instance that I used SSM to connect to. It had no SSH keys, was located in a private subnet. It ran out of space and it got essentially "bricked" because SSM stopped working. SSM is a critical connectivity tool and an instance becoming unaccessible for no good reason is a huge risk.

Mar 18 '22 13:03 asyschikov

It is an important operational requirement to log in to an instance with its root partition full. If SSM session manager cannot handle this, SSH, which has no problem whatsover under these conditions, is still needed as a backup method (with all that it implies).

@nitikagoyal87 Is there at least any type of workaround that we can apply to make SSM session manager work when a root volume is full? Given the time this has been open, is this being worked on?

Sep 08 '22 11:09 pplu

I'm surprised to see this issue unaddressed. In the EKS best practice docs it is suggested as a best practice to disable SSH and use SSM instead.

Losing all access to a host in a case like this can be extremely painful.

Apr 17 '23 18:04 wstewartlyra

This is an issue that needs to be addressed.

May 24 '23 20:05 mhare-bokf

Was pretty stunned to find out about this bug and how long it's been open for. SSM is a great tool and can replace SSH for us almost completely... except for this one critical issue blocking it.

If some pet has failed and run out of space in a weird way the last thing I want to spend time doing is to go and mount the disk on another machine and expand it just to get enough working space I can boot and SSM into the host to figure out what is actually going wrong.

Jun 07 '23 05:06 jethrocarr

@nitikagoyal87 Was there ever an output of your initial investigation of this?

Thanks

Oct 13 '23 13:10 dpwrussell

the best solution I found for linux boxes, was to

stop the "full root" instance
copy the disk information (disk id) from aws console about its disk.
lets get real careful now.
detach the root volume from the "full root" instance
create a temporary t2.micro (free tier eligible) instance of same O/S
once its up, attach to it, the root volume from the "full root" instance you detached in 4 above.
login and become root on the new t2 micro instance.
mkdir /tmp/mnt (suffer through the "its already there" if presented..)
perform an fdisk -l to determine the device information for the attached "full root" disk.
Heres the trick!! mount the "full root" disk onto the t2.micro nosuid : mount -o nosuid,rw (the disk id)/tmp/mnt
clean up the possible space hogs in : /tmp/mnt/var/log/ /tmp/mnt/tmp (core dumps and other unexpected things).
when you have (hopefully) found all the wads: umount /tmp/mnt
in the aws console, detach the "full root" disk from the t2.micro, and re-attach it to the original instance.
start the original "full root" instance. (it should come up and allow you to log in again.)
dump the t2.micro.

HTH.

Feb 27 '24 01:02 gopher55

This is an absolutely critical bug. If SSM absolutely must use a disk, we should be able to set up a separate partition to keep it working even if the rest of the system isn't. If SSM can't be relied on as a critical investigation tool, server admins will have to rely on SSH. This increases complexity, security risks and goes against AWS best practices for EKS, to say the least. @VishnuKarthikRavindran as you have been contributing the most recently, is there a chance you could raise this issue with the product team to give it a priority?

Mar 27 '24 12:03 dmitry-livchak-qco

As the previous comments already stated, it's critical, that SSM keeps working even if the disk is full. sshd has this capability since ever and especially in those situations, you need to rely on access.

Anything else like SSH keys is nowadays outdated and insecure but seemingly, ssm does not yet have the maturity to replace it properly.

Jun 14 '24 13:06 lobeck