amazon-ssm-agent icon indicating copy to clipboard operation
amazon-ssm-agent copied to clipboard

ssm-agent-worker max CPU usage at boot (infrequent)

Open sapphirecat opened this issue 4 years ago • 43 comments

Sometimes, the ssm-agent-worker gets stuck consuming all CPU resources (e.g. 179%+ on a 2-vCPU instance) after reboot.

I'm not sure what's helpful, but I'm attaching the logs I have, except that I cut journalctl output down to the lines containing amazon-ssm only. The instance is run in America/New_York (GMT -05:00 currently) after initial configuration, and timestamps appear to be local.

I used top to send signal 15, then signal 9, to the worker (the first did not work) and the service did not appear to notice, so I restarted the whole snap.amazon-ssm-agent.amazon-ssm-agent.service service after a few more seconds (plus time it took to even find that name.)

This AMI is customized, of course, but ultimately derives from the current Ubuntu EC2 releases listing, us-east-1 20.04 amd64.

Attachment: logs.zip

sapphirecat avatar Feb 03 '21 15:02 sapphirecat

I'm experiencing the same symptom, although much more frequently of late. this is impacting ~20-25% of newly launched instances

svoeller99 avatar Feb 16 '21 18:02 svoeller99

@svoeller99 You are seeing the issue on Ubuntu as well?

kchitalia-amzn avatar Mar 09 '21 16:03 kchitalia-amzn

yes - we're running Ubuntu 18.04

svoeller99 avatar Apr 07 '21 23:04 svoeller99

I am seeing the same issue on Ubuntu 20.04

saraiyakush avatar Apr 08 '21 11:04 saraiyakush

I have this same problem, really high CPU usage by amazon-ssm-agent, the machine gets totally unstable and there is no explanation.

What are we supposed to do ???

voltuer avatar Jun 03 '21 14:06 voltuer

Hi @tr4g We had same issue and am keeping an eye on this thread and saw your comment. Could you please let us know what OS and Kernel version you are seeing the problem in? Thank you!

rjayanthi-prod avatar Jun 03 '21 21:06 rjayanthi-prod

+1 same issue ...

vb8448 avatar Jun 05 '21 18:06 vb8448

Windows Server 2019 - 4vCPU - SAME ISSUE.

iam-sysop avatar Jul 14 '21 14:07 iam-sysop

@tr4g @sapphirecat @svoeller99 @thecarnie @saraiyakush Thanks for reaching us. Sorry for the delay in response. Are we seeing this issue with the latest agent now?

VishnuKarthikRavindran avatar Aug 25 '21 23:08 VishnuKarthikRavindran

@VishnuKarthikRavindran It was somewhat rare, happening maybe once every month or two, launching on average 1.2 instances per day. (Just infrequently enough that I never built a script to automatically handle the situation.) It hasn't happened again for me since I filed the issue, but I can't say with confidence that it's fixed.

We have continued to track the latest Ubuntu 20.04 AMI, so we should be getting both agent and kernel updates accordingly.

sapphirecat avatar Aug 26 '21 13:08 sapphirecat

@VishnuKarthikRavindran for me is happening still with version 3.0.1124.0 on Ubuntu 20.04:

snap info amazon-ssm-agent
name:      amazon-ssm-agent
summary:   Agent to enable remote management of your Amazon EC2 instance configuration
publisher: Amazon Web Services (aws✓)
store-url: https://snapcraft.io/amazon-ssm-agent
contact:   https://aws.amazon.com/contact-us/
license:   unset
description: |
  The SSM Agent runs on EC2 instances and enables you to quickly and easily
  execute remote commands or scripts against one or more instances. The agent
  uses SSM documents. When you execute a command, the agent on the instance
  processes the document and configures the instance as specified. Currently,
  the SSM Agent and Run Command enable you to quickly run Shell scripts on an
  instance using the AWS-RunShellScript SSM document.
commands:
  - amazon-ssm-agent.ssm-cli
services:
  amazon-ssm-agent: simple, enabled, inactive
snap-id:      T09mpujiTnzSdSCuqNkE7YXXTWDq13tC
tracking:     latest/stable/ubuntu-20.04
refresh-date: yesterday at 18:01 UTC
channels:
  latest/stable:    3.0.1124.0 2021-07-29 (4046) 26MB classic
  latest/candidate: 3.1.192.0  2021-08-19 (4662) 27MB classic
  latest/beta:      ↑
  latest/edge:      ↑
installed:          3.0.1124.0            (4046) 26MB classic

radykal-com avatar Aug 27 '21 09:08 radykal-com

Hi @radykal-com, Is this issue reproducible on your end? If possible, could you please check whether you are seeing this with the latest version? Thanks

VishnuKarthikRavindran avatar Aug 27 '21 16:08 VishnuKarthikRavindran

Well, its's not easy to reproduce, as it happens randomly with very low frequency. It happened to 6 or 7 instances over a total of 100+. When it happens it happens from the moment the instance starts. I decided to just uninstall it from our AMIs

radykal-com avatar Aug 28 '21 12:08 radykal-com

Thanks @radykal-com for reaching us. We have done many improvements in the latest SSM agent versions. Please let us know if the issue persists with the latest one if you think of using the agent any time.

VishnuKarthikRavindran avatar Aug 30 '21 23:08 VishnuKarthikRavindran

+1 here, Ubuntu 20.04, every 10 mins or so only running simple website in nginx docker on t2.mirco. Locks entire system 100% CPU for about 5 mins. Tried rebooting via console and on the cli.

This is pretty unacceptable and am interested in possibly receiving refund on my 3 reserved instances, how would I start that process so I can move to a more stable cloud server?

ghost avatar Sep 03 '21 20:09 ghost

Hi @WinterTFG, Sorry to hear about that. Could you please share us the repro steps if it is reproducible on your end?

Like said above, we have done many improvements in the latest SSM agent versions. If possible, could you run with the latest one. Thanks.

VishnuKarthikRavindran avatar Sep 08 '21 23:09 VishnuKarthikRavindran

I'm seeing similar behaviour on the latest:

summary:   Agent to enable remote management of your Amazon EC2 instance configuration
publisher: Amazon Web Services (aws✓)
store-url: https://snapcraft.io/amazon-ssm-agent
license:   unset
description: |
  The SSM Agent runs on EC2 instances and enables you to quickly and easily
  execute remote commands or scripts against one or more instances. The agent
  uses SSM documents. When you execute a command, the agent on the instance
  processes the document and configures the instance as specified. Currently,
  the SSM Agent and Run Command enable you to quickly run Shell scripts on an
  instance using the AWS-RunShellScript SSM document.
commands:
  - amazon-ssm-agent.ssm-cli
services:
  amazon-ssm-agent: simple, enabled, active
snap-id:      T09mpujiTnzSdSCuqNkE7YXXTWDq13tC
tracking:     latest/stable/ubuntu-20.04
refresh-date: 18 days ago, at 01:03 CEST
channels:
  latest/stable:    3.0.1124.0 2021-07-29 (4046) 26MB classic
  latest/candidate: 3.1.282.0  2021-09-09 (4750) 27MB classic
  latest/beta:      ↑
  latest/edge:      ↑
installed:          3.0.1124.0            (4046) 26MB classic

It was stale for 156 hours, and was eating 300% CPU.

mkdotam avatar Sep 13 '21 10:09 mkdotam

Hi @mkdotam, It looks like the installed agent version is 3.0.1124.0. Could you please check whether you are seeing this with latest version - 3.1.282.0? Thanks

VishnuKarthikRavindran avatar Sep 14 '21 15:09 VishnuKarthikRavindran

Still happening using snap version 3.1.338.0. I'm running ubuntu-focal-20.04-arm64. Happened twice just today

Whale-Observer-App avatar Sep 24 '21 19:09 Whale-Observer-App

Hi @Whale-Observer-App, May I know how did you reproduce this one? Also could you please attach the logs if possible. Thanks.

VishnuKarthikRavindran avatar Sep 26 '21 16:09 VishnuKarthikRavindran

just rebooted 2nd time today, amazon-ssm-agent, revision 4046

tomaskovacik avatar Oct 11 '21 16:10 tomaskovacik

This has started happening to me today. 50%+ consistent. New servers, windows, elastic beanstalk created the server. I created a dump file of the process if that can help. IIS 10.0 running on 64bit Windows Server 2016/2.8.0

lilmidnit avatar Nov 24 '21 22:11 lilmidnit

I have the same here, since months ago one of my instances gets ssm-agent randomly peaking CPU to a point that it's not even accessible anymore.

AWS Ubuntu 20.04 amazon-ssm-agent 3.0.1124.0

gcstr avatar Dec 21 '21 09:12 gcstr

add swap as 1st step after creating instance:

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-memory-swap-file/

since this I never have issue with agent

tomaskovacik avatar Dec 21 '21 10:12 tomaskovacik

Thanks for reaching us again. We were able to reproduce this issue on our end. The fix was given in the following agent release https://github.com/aws/amazon-ssm-agent/releases/tag/3.1.426.0. Could you all please try updating to the latest one?

VishnuKarthikRavindran avatar Dec 23 '21 17:12 VishnuKarthikRavindran

Thanks! I just updated it. Given that the issue is pretty random, I can't immediately test. But I'll keep monitoring in the upcoming days.

gcstr avatar Dec 24 '21 05:12 gcstr

I was dealing with this problem imagining it was some reaction to my code. However, after 3 days without much success I decided to try to send my code to another machine with the same operating system. It worked. My code stopped dying with this CPU spike coming from the ssm agent.

Ubuntu 20.04 Amazon ssm agent 4046

Pxeba avatar Jan 05 '22 13:01 Pxeba

I don't know why this issue was closed if no solution was given. I'm experiencing it as well

play-station avatar Jul 19 '22 22:07 play-station

Today the same situation. Can't even login to console normally due to excessive load. la is over 15

baturinivan avatar Sep 10 '22 01:09 baturinivan

You can simply solve this problem by running , sudo snap remove amazon-ssm-agent

You can find the full answer here. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-uninstall-agent.html

shashikachamod1992 avatar Sep 13 '22 12:09 shashikachamod1992