Windows/EC2: ssm-document-worker process remaining after Service-Stop and resulting in cannot access IPC file + DeliveryTimedOut
Describe the bug
- At the moment we see a recurring, but intermittently issue for our SSM Agents running on Windows OS (normal EC2 instance).
- We are using SSM Agent to execute Systems Manager Documents via scheduled Associations (= Run Command).
- In case we hit the issue, we see for this Target the Detailed Status for this Association in state "DeliveryTimedOut" in AWS Console for this Association Execution in State Manager.
Current Behavior
-
As stated before, we apply a Document to a bunch of (Windows-)targets. A small part results in Detailed Status:
DeliveryTimedOut -
Once we checking the local SSM Agent on Windows, we found following pattern across the affected EC2 instances:
a) 1st, we checked the amazon-ssm-agent.log and found following information/error:
2024-05-20 04:01:15 INFO [CredentialRefresher] Credentials ready
2024-05-20 04:01:15 INFO [CredentialRefresher] Next credential rotation will be in 29.999736375 minutes
2024-05-20 04:10:20 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-476: The process cannot access the file because it is being used by another process.
2024-05-20 04:10:44 ERROR [amazon-ssm-agent] message C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473 failed to read: open C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\respondent-20240519200116-473: The process cannot access the file because it is being used by another process.
2024-05-20 04:11:21 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-477: The process cannot access the file because it is being used by another process.
2024-05-20 04:12:23 ERROR [amazon-ssm-agent] Error occurred while removing the IPC file: remove C:\ProgramData\Amazon\SSM\InstanceData\i-123456789awsIdEc2\channels\health\surveyor-20240519200113-478: The process cannot access the file because it is being used by another process.
2024-05-20 04:31:15 INFO EC2RoleProvider Successfully connected with instance profile role credentials
2024-05-20 04:31:16 INFO [CredentialRefresher] Credentials ready
b) If we afterwards stop the Windows-Service "Amazon SSM Agent", the process "ssm-document-worker" remaining in the Task-Manager "Process-List"!
c) If we now start the Windows-Service "Amazon SSM Agent" again, the [headless] "ssm-document-worker" process remaining permanent - The 2nd "ssm-document-worker" is only shown one a Document is executed - Hence as an result there are sometimes two "ssm-document-worker" processes - The issue with DeliveryTimedOut remains:
It seems this remaining (zombie) "ssm-document-worker" process locking the access to some internal files and blocking further execution of documents/run commands to this target!
Now -again- all Associations/Run Commands to an SSM/Instance is this status, will result in a long time "Pending" and afterwards in Failed with "DeliveryTimedOut".
Workaround:
d) We need to stop the Windows-Service "Amazon SSM Agent" and kill the "ssm-document-worker" process via Task Manager using "End task". Afterwards we start the Windows-Service "Amazon SSM Agent" again and apply the asssociation again. It's working right away (since the permanent, headless "ssm-document-worker" process is gone). Just stop and kill remaining ssm-document-workers.
Expected Behavior:
The instances with those Documents/Associations running this many, many months without this bug-pattern. We assume it could be started with the update from 3.3.380.0 to 3.3.418.0 (in our case 15th May 2024) - But we are not sure about this - At least we see a growing number of issues. Having this said, we do not expect those Delivery TimedOuts at all, in case the Windows Services is in Status Running.
OS Version / Host
OS: Microsoft Windows Server 2019 Datacenter (Platform-Version: 10.0.17763) Host: EC2 Instance with IMDSv1 (Managed-Instance)
SSM Agent Version
Amazon SSM Agent Version: 3.3.418.0
Other information
I've opened AWS-Case 171620678600565 with SSM-Team. We are share full logs and more details (region, instance-id, etc) with this case. Feel free to request more details here as well - I'll do my best to upload them in an anonymized way.
Is there any update to this Issue? We are having same issue with currently latest version 3.3.551.0
36 out of 349 Hosts affected. ssm-document-worker is killed by the agent a couple minutes after restarting. New associations stay pending untill the stuck processes get killed but then is able to execute and finish.
update: AWS Support has confirmed the bug and the ssm team is reportedly working on it.
Hi @gmergulhao - It's good to know that you are not alone with the problem. We are still debugging with AWS Support this issue, which still happens occasionally on our Windows EC2s. The SSM team requested us to collect process-data using Microsoft Tools: procexp64.exe + handle.exe. I'm going to share the requested steps/data here:
On one of the instances with issue, please download the process explorer tool (procexp64.exe) and extract it:
- After the issue has occurred, please do not stop the SSM Agent.
- From the extracted path, run the procexp64.exe as administrator
- Execute any of the SSM Run command which will return the same error on the target instance
- Once the run command is complete is error, in the process explorer, select view > Show Lower Pane. Select view > Lower Pane View > Handles
- Search and select “amazon-ssm-agent.exe” > file > save > save the file
- Select “ssm-agent-worker.exe” > file > save > save the file
- Select “ssm-document-worker” > file > save > save the file
Additionally, please install the handle tool (handle.exe):
- Extract the tool and open a command prompt as admin. Navigate to the extracted path
- Run the command “Handle.exe ssm > c:\ssm_handle.txt”
- This will generate the file ssm_handle.txt in C drive.
Since all kind of data might help, you maybe try to collect this process-information as well? Did you get more details regarding resolution of this bug? - I hope we can catch the process-dump/-information soon and share it with SSM-team.
According to aws, release 3.3.987.0 should include a workaround for this issue
Yes, SSM-Team released version 3.3.987.0 👍 - The changelog mention:
Use exponential retry for document worker, increase retry interval and attempt count when reading IPC files
We are using the AWS-managed document "AWS-UpdateSSMAgent", which using the AWS-reference ssm-agent-manifest.json internally. In this file the the latest available version is defined - At the moment the new version 3.3.987.0 isn't available yet. We hope to update our SSM agents on Windows soon and report back, if the issue is fixed. I'll try to leave a short note here, once version 3.3.987.0 is availble (at least in eu-central-1).
We have released two bug fix commits that address this issue in both 3.3.987 and 3.3.1142.0. Please reopen if you still see this issue recurring.