amazon-ssm-agent icon indicating copy to clipboard operation
amazon-ssm-agent copied to clipboard

Session Manager start-session hangs when root partition full

Open iancward opened this issue 6 years ago • 23 comments

I dunno if this is the appropriate place for this, but when I attempt to start a session (either via the AWS Console or the CLI--via the session-manager-plugin) onto an EC2 will a full root partition, it just hangs. In the console I get a blank screen for a long time and then eventually a blinking cursor that doesn't work.

In the CLI (via session-manager-plugin), I get a message that it's starting a session but then it just hangs. The CLI/plugin doesn't respond to Ctrl+C or Ctrl+D; in fact, I have to start a new terminal on my workstation and kill the CLI command and the plugin command.

iancward avatar Mar 12 '19 19:03 iancward

Thanks for reaching out to us. We will investigate this.

nitikagoyal87 avatar Apr 08 '19 16:04 nitikagoyal87

Was this ever resolved? I'm having a similar problem where I just see a black screen in the console, if I click, I get the cursor that doesn't do anything.

Millerborn avatar Sep 13 '19 14:09 Millerborn

I am also having this issue

TajMahPaul avatar Oct 15 '19 19:10 TajMahPaul

Hi, I'm experiencing this issue as well. Are there any updates on whether it is going to be fixed? It would be great to at least get a descriptive error message if the connection to instances with no space is impossible. Right now the AWS Console / CLI just hangs without any visible reason.

IrinaTerlizhenko avatar Dec 13 '19 13:12 IrinaTerlizhenko

+1 Using latest Ubuntu 18 AMI with ssm agent running and installed and necessary SSM/CloudWatch policies attached to role. Weirdest thing, it happens on some instances and not on others. Seems like a bug.

xacaxulu avatar Jan 10 '20 18:01 xacaxulu

Unfortunately this is still happening. AWS, you should really do something here.

iniinikoski avatar Mar 06 '20 08:03 iniinikoski

Hello,

We experience the same issue on Red Hat 7.7. We couldn't reach the instance through Session Manager once the partition /var/ was full.

On other side, we observed a different behavior, when partition /var/log was full, the machine was still reachable. Anyway, when we rely on Session Manager to get remote access to server, we would expect to have access to EC2 even if filesystem full.

cholletjo avatar Jun 30 '20 14:06 cholletjo

In all cases where Session Manager has not been able to successfully open a connection due to disk full, I'm able to use SSH to get access. We would like to remove SSH and switch solely to Session Manager, but that doesn't seem possible with longstanding issues like this and we are leaving SSH as a backup, where we can open the ports and distribute the pem key as needed during emergencies.

rossmckelvie avatar Aug 04 '20 15:08 rossmckelvie

This is a bit of a circular problem given that the way you're supposed to check the available disk space on a volume is by opening a terminal session in the instance. 🤦‍♀️ https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-describing-volumes.html

crawforde avatar Sep 03 '20 17:09 crawforde

Not sure if it's totally related, I'm running into an issue with the start-session command hanging, when the target instance is offline. As @iancward mentioned, ctrl+c, etc does not exit. Closing and re-opening the terminal is necessary.

I haven't had the opportunity to really dig into it, but I took a quick look at the code for the CLI, and found this:

https://github.com/aws/aws-cli/blob/master/awscli/customizations/sessionmanager.py

        try:
            # ignore_user_entered_signals ignores these signals
            # because if signals which kills the process are not
            # captured would kill the foreground process but not the
            # background one. Capturing these would prevents process
            # from getting killed and these signals are input to plugin
            # and handling in there
            with ignore_user_entered_signals():
                # call executable with necessary input
                check_call(["session-manager-plugin",
                            json.dumps(response),
                            region_name,
                            "StartSession",
                            profile_name,
                            json.dumps(parameters),
                            endpoint_url])
            return 0

Looks like the terminate signals are being swallowed intentionally? I'm not totally sure, but I reckon this ties into things 😄

jackdcasey avatar Oct 29 '20 02:10 jackdcasey

Any updates on resolving this issue? I've run into the same problem.

twhetzel avatar Oct 06 '21 07:10 twhetzel

Any updates please?

gaalandr avatar Oct 08 '21 10:10 gaalandr

I've heard I should try to increase the volume size, but it's not clear if this will delete all data on the disk.

twhetzel avatar Oct 09 '21 02:10 twhetzel

I've heard I should try to increase the volume size, but it's not clear if this will delete all data on the disk.

Hi @twhetzel, increasing the volume size should not delete the data. However, if you increase the EBS volume size, you will need to access the instance and perform some commands to extend file system to use that extra capacity. See the below links for Linux and Windows instances:

Windows: https://aws.amazon.com/premiumsupport/knowledge-center/expand-ebs-root-volume-windows/ or https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/recognize-expanded-volume-windows.html

Linux: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html

IdrisAbdul-Hussein avatar Oct 28 '21 10:10 IdrisAbdul-Hussein

This is a very important issue. I had an EC2 instance that I used SSM to connect to. It had no SSH keys, was located in a private subnet. It ran out of space and it got essentially "bricked" because SSM stopped working. SSM is a critical connectivity tool and an instance becoming unaccessible for no good reason is a huge risk.

asyschikov avatar Mar 18 '22 13:03 asyschikov

It is an important operational requirement to log in to an instance with its root partition full. If SSM session manager cannot handle this, SSH, which has no problem whatsover under these conditions, is still needed as a backup method (with all that it implies).

@nitikagoyal87 Is there at least any type of workaround that we can apply to make SSM session manager work when a root volume is full? Given the time this has been open, is this being worked on?

pplu avatar Sep 08 '22 11:09 pplu

I'm surprised to see this issue unaddressed. In the EKS best practice docs it is suggested as a best practice to disable SSH and use SSM instead.

Losing all access to a host in a case like this can be extremely painful.

wstewartlyra avatar Apr 17 '23 18:04 wstewartlyra

This is an issue that needs to be addressed.

mhare-bokf avatar May 24 '23 20:05 mhare-bokf

Was pretty stunned to find out about this bug and how long it's been open for. SSM is a great tool and can replace SSH for us almost completely... except for this one critical issue blocking it.

If some pet has failed and run out of space in a weird way the last thing I want to spend time doing is to go and mount the disk on another machine and expand it just to get enough working space I can boot and SSM into the host to figure out what is actually going wrong.

jethrocarr avatar Jun 07 '23 05:06 jethrocarr

@nitikagoyal87 Was there ever an output of your initial investigation of this?

Thanks

dpwrussell avatar Oct 13 '23 13:10 dpwrussell

the best solution I found for linux boxes, was to

  1. stop the "full root" instance
  2. copy the disk information (disk id) from aws console about its disk.
  3. lets get real careful now.
  4. detach the root volume from the "full root" instance
  5. create a temporary t2.micro (free tier eligible) instance of same O/S
  6. once its up, attach to it, the root volume from the "full root" instance you detached in 4 above.
  7. login and become root on the new t2 micro instance.
  8. mkdir /tmp/mnt (suffer through the "its already there" if presented..)
  9. perform an fdisk -l to determine the device information for the attached "full root" disk.
  10. Heres the trick!! mount the "full root" disk onto the t2.micro nosuid : mount -o nosuid,rw (the disk id)/tmp/mnt
  11. clean up the possible space hogs in : /tmp/mnt/var/log/ /tmp/mnt/tmp (core dumps and other unexpected things).
  12. when you have (hopefully) found all the wads: umount /tmp/mnt
  13. in the aws console, detach the "full root" disk from the t2.micro, and re-attach it to the original instance.
  14. start the original "full root" instance. (it should come up and allow you to log in again.)
  15. dump the t2.micro.

HTH.

gopher55 avatar Feb 27 '24 01:02 gopher55

This is an absolutely critical bug. If SSM absolutely must use a disk, we should be able to set up a separate partition to keep it working even if the rest of the system isn't. If SSM can't be relied on as a critical investigation tool, server admins will have to rely on SSH. This increases complexity, security risks and goes against AWS best practices for EKS, to say the least. @VishnuKarthikRavindran as you have been contributing the most recently, is there a chance you could raise this issue with the product team to give it a priority?

dmitry-livchak-qco avatar Mar 27 '24 12:03 dmitry-livchak-qco

As the previous comments already stated, it's critical, that SSM keeps working even if the disk is full. sshd has this capability since ever and especially in those situations, you need to rely on access.

Anything else like SSH keys is nowadays outdated and insecure but seemingly, ssm does not yet have the maturity to replace it properly.

lobeck avatar Jun 14 '24 13:06 lobeck