community.aws icon indicating copy to clipboard operation
community.aws copied to clipboard

ssm connection plugin fails at gathering facts

Open MVJosh opened this issue 4 years ago • 42 comments

SUMMARY

Using the ssm plugin, playbooks fail at the gathering facts stage, with:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
fatal: [i-xxxxx]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

If using gather_facts: false, the same error occurs at the first task

As far as I can tell, SSM is configured correctly: I can run aws ssm start-session --target i-xxxxx from my ansible host and successfully get a shell on the target host.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ssm connection plugin

ANSIBLE VERSION
ansible 2.9.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/ubuntu/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ubuntu/.local/lib/python3.8/site-packages/ansible
  executable location = /home/ubuntu/.local/bin/ansible
  python version = 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0]
CONFIGURATION
INVENTORY_ENABLED(/etc/ansible/ansible.cfg) = ['aws_ec2']
OS / ENVIRONMENT

Both the instance running Ansible and the target (SSM-managed) instance are regular ubuntu 20.04 AMIs

STEPS TO REPRODUCE
  • Launch 2 EC2 instances running Ubuntu 20.04 (in same AWS region)
  • Install ansible, boto3, aws cli & sms plugin on first instance
  • Run AWS Systems Manager quick start (using default roles & choosing all instances as targets)
  • confirm that SSM can start a shell from Ansible to second instance: aws ssm start-session --target i-xxxxx
  • Enable ansible aws_ec2 inventory plugin
  • Run the following playbook with the attached inventory config:
# Playbook

- hosts: all
  vars:
    ansible_connection: community.aws.aws_ssm
    ansible_aws_ssm_region: eu-west-2 # substitute for your region
  tasks:
    - name: test
      command:
        cmd: ls -l
# aws_ec2.yaml inventory

plugin: aws_ec2
regions:
  - eu-west-2
keyed_groups:
  - prefix: tag
    key: tags
  - prefix: aws_region
    key: placement.region
hostnames:
  - instance-id
compose:
  ansible_host: instance-id

Run:

ansible-playbook -i aws_ec2.yaml -c community.aws.aws_ssm test-playbook.yaml
EXPECTED RESULTS

Playbook runs successfully with no errors

ACTUAL RESULTS

Playbook errors on gathering facts stage, with:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
fatal: [i-xxxxx]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

with -vvvv verbosity:

ansible-playbook 2.9.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/ubuntu/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ubuntu/.local/lib/python3.8/site-packages/ansible
  executable location = /home/ubuntu/.local/bin/ansible-playbook
  python version = 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0]
Using /etc/ansible/ansible.cfg as config file
setting up inventory plugins
Parsed /home/ubuntu/aws_ec2.yaml inventory source with aws_ec2 plugin
Loading callback plugin default of type stdout, v2.0 from /home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/callback/default.py

PLAYBOOK: test_ssm.yaml ****************************************************************************************************************************************************************************************
Positional arguments: test_ssm.yaml
verbosity: 4
connection: community.aws.aws_ssm
timeout: 10
become_method: sudo
tags: ('all',)
inventory: ('/home/ubuntu/aws_ec2.yaml',)
forks: 5
1 plays in test_ssm.yaml

PLAY [all] *****************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************************************************
task path: /home/ubuntu/test_ssm.yaml:1
<i-xxxxx> ESTABLISH SSM CONNECTION TO: i-xxxxx
<i-xxxxx> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "i-xxxxx-yyyyy", "TokenValue": "XXXXX", "StreamUrl": "wss://ssmmessages.eu-west-2.amazonaws.com/v1/data-channel/i-yyyyy-zzzzz?role=publish_subscribe", "ResponseMetadata": {"RequestId": "xxxx-xxxx-xxxx-xxxx", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "xxxx-xxxx-xxxx-xxxx-xxxx", "content-type": "application/x-amz-json-1.1", "content-length": "626", "date": "Fri, 19 Jun 2020 14:41:52 GMT"}, "RetryAttempts": 0}}', 'eu-west-2', 'StartSession', '', '{"Target": "i-xxxxx"}', 'https://ssm.eu-west-2.amazonaws.com']
<i-xxxxx> SSM CONNECTION ID: i-xxxx-xxxxx
<i-xxxxx> EXEC echo ~
<i-xxxxx> _wrap_command: 'echo XXXXX
echo ~
echo $'\n'$?
echo YYYYY
'
<i-xxxxx> EXEC stdout line: 
<i-xxxxx> EXEC stdout line: Starting session with SessionId: i-xxxxx-yyyyy
<i-xxxxx> EXEC remaining: 60
<i-xxxxx> EXEC stdout line: $ stty -echo
<i-xxxxx> EXEC stdout line: PS1=''
<i-xxxxx> EXEC stdout line: echo XXXXX
<i-xxxxx> EXEC stdout line: echo ~
<i-xxxxx> EXEC stdout line: echo $'\n'$?
<i-xxxxx> EXEC stdout line: echo YYYYY
<i-xxxxx> POST_PROCESS: echo ~
echo $'\n'$?
<i-xxxxx> ssm_retry: attempt: 0, caught exception(invalid literal for int() with base 10: "echo $'\\n'$?") from cmd (echo ~...), pausing for 0 seconds
<i-xxxxx> CLOSING SSM CONNECTION TO: i-xxxxx
<i-xxxxx> TERMINATE SSM SESSION: i-xxxxx-yyyyy
<i-xxxxx> ESTABLISH SSM CONNECTION TO: i-xxxxx
<i-xxxxx> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "i-xxxxx-yyyyy", "TokenValue": "XXXXX", "StreamUrl": "wss://ssmmessages.eu-west-2.amazonaws.com/v1/data-channel/i-yyyyy-zzzzz?role=publish_subscribe", "ResponseMetadata": {"RequestId": "xxxx-xxxx-xxxx-xxxx-xxxx", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "xxxx-xxxx-xxxx-xxxx-xxxx", "content-type": "application/x-amz-json-1.1", "content-length": "626", "date": "Fri, 19 Jun 2020 14:41:53 GMT"}, "RetryAttempts": 0}}', 'eu-west-2', 'StartSession', '', '{"Target": "i-xxxxx"}', 'https://ssm.eu-west-2.amazonaws.com']
<i-xxxxx> SSM CONNECTION ID: i-xxxx-xxxx
<i-xxxxx> EXEC echo ~
<i-xxxxx> _wrap_command: 'echo YYYYY
echo ~
echo $'\n'$?
echo ZZZZZ
'
<i-xxxxx> EXEC stdout line: 
<i-xxxxx> EXEC stdout line: Starting session with SessionId: i-xxxx-yyyy
<i-xxxxx> EXEC remaining: 60
<i-xxxxx> EXEC stdout line: $ stty -echo
<i-xxxxx> EXEC stdout line: PS1=''
<i-xxxxx> EXEC stdout line: echo YYYYY
<i-xxxxx> EXEC stdout line: echo ~
<i-xxxxx> EXEC stdout line: echo $'\n'$?
<i-xxxxx> EXEC stdout line: echo ZZZZZ
<i-xxxxx> POST_PROCESS: echo ~
echo $'\n'$?
<i-xxxxx> CLOSING SSM CONNECTION TO: i-xxxxx
<i-xxxxx> TERMINATE SSM SESSION: i-xxxxx-yyyyy
The full traceback is:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/executor/task_executor.py", line 146, in run
    res = self._execute()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/executor/task_executor.py", line 645, in _execute
    result = self._handler.run(task_vars=variables)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/gather_facts.py", line 79, in run
    res = self._execute_module(module_name=fact_module, module_args=mod_args, task_vars=task_vars, wrap_async=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 780, in _execute_module
    self._make_tmp_path()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 343, in _make_tmp_path
    tmpdir = self._remote_expand_user(self.get_shell_option('remote_tmp', default='~/.ansible/tmp'), sudoable=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 664, in _remote_expand_user
    data = self._low_level_execute_command(cmd, sudoable=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 1075, in _low_level_execute_command
    rc, stdout, stderr = self._connection.exec_command(cmd, in_data=in_data, sudoable=sudoable)
  File "/home/ubuntu/.ansible/collections/ansible_collections/community/aws/plugins/connection/aws_ssm.py", line 197, in wrapped
    return_tuple = func(self, *args, **kwargs)
  File "/home/ubuntu/.ansible/collections/ansible_collections/community/aws/plugins/connection/aws_ssm.py", line 389, in exec_command
    returncode, stdout = self._post_process(stdout, mark_begin)
  File "/home/ubuntu/.ansible/collections/ansible_collections/community/aws/plugins/connection/aws_ssm.py", line 442, in _post_process
    returncode = int(stdout.splitlines()[-2])
ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
fatal: [i-xxxxx]: FAILED! => {
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}

MVJosh avatar Jun 19 '20 14:06 MVJosh

Same issue, unusual plugin

ilyavaiser avatar Aug 13 '20 23:08 ilyavaiser

I looked into this, and there are several things wrong I think.

See this line

            cmd = "echo " + mark_start + "\n" + cmd + "\necho $'\\n'$?\n" + "echo " + mark_end + "\n"

The shell used by the session-manager is sh, which does not support bashisms like $'\n' to print a new line. See the difference:

sh

$ echo $'\n'$?
$
0

bash

$ echo $'\n'$?

0

However, that might be a red herring. Notice the output:

<i-xxxxx> EXEC stdout line: $ stty -echo
<i-xxxxx> EXEC stdout line: PS1=''
<i-xxxxx> EXEC stdout line: echo YYYYY
<i-xxxxx> EXEC stdout line: echo ~
<i-xxxxx> EXEC stdout line: echo $'\n'$?
<i-xxxxx> EXEC stdout line: echo ZZZZZ

That is the input commands, not the stdout. Where is the output?

abeluck avatar Sep 03 '20 15:09 abeluck

Here is a workaround that avoids sh entirely by starting a bash shell.

--- a/plugins/connection/aws_ssm.py	2020-09-03 18:43:43.818000000 +0200
+++ b/plugins/connection/aws_ssm.py	2020-09-03 18:43:19.805000000 +0200
@@ -288,11 +288,17 @@
 
         profile_name = ''
         region_name = self.get_option('region')
-        ssm_parameters = dict()
 
         client = boto3.client('ssm', region_name=region_name)
         self._client = client
-        response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+
+        if self.is_windows:
+            ssm_parameters = dict()
+            response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+        else:
+            ssm_parameters = {"command": ["bash -l"]}
+            response = client.start_session(Target=self.instance_id, DocumentName="AWS-StartInteractiveCommand", Parameters=ssm_parameters)
+
         self._session_id = response['SessionId']
 
         cmd = [

abeluck avatar Sep 03 '20 16:09 abeluck

I looked at this for a bit today and found that you were right to be skeptical @abeluck .

The issue appears to actually be in _prepare_terminal. The current code does not wait for tty -echo to return before sending additional bytes. This causes all characters that are sent prior to the return of tty -echo to be printed and any characters written afterwards to not be printed.

Since the commands are sent so quickly, the terminal prints all lines. Then we break out of the loop because <start_mark> is definitely present in echo <start_mark>. You can quickly fix the behavior by adding a delay (and also probably the fix for the bashism) Here are some logs to demonstrate:

After removing the loop breakout:

Loading callback plugin minimal of type stdout, v2.0 from /usr/lib/python3.8/site-packages/ansible/plugins/callback/minimal.py
META: ran handlers
<ip-___-__-_-___.ec2.internal> ESTABLISH SSM CONNECTION TO: i-0cc44a1e2f7995c53
<ip-___-__-_-___.ec2.internal> SSM COMMAND: ['/usr/local/bin/session-manager-plugin',...]
<ip-___-__-_-___.ec2.internal> SSM CONNECTION ID: DanFallon-095d5d09e63ce624c
<ip-___-__-_-___.ec2.internal> EXEC echo ~ubuntu
<ip-___-__-_-___.ec2.internal> _wrap_command: 'echo PKjewgTIHJcLjaXeCjlasDjfZw
echo ~ubuntu
echo $'\n'$?
echo uZmXXhFRmkCzCEZEPjWMQlNiIF
'
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: Starting session with SessionId: DanFallon-095d5d09e63ce624c
<ip-___-__-_-___.ec2.internal> EXEC remaining: 60
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ stty -echo ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: PS1='' ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo PKjewgTIHJcLjaXeCjlasDjfZw
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo ~ubuntu
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo $'\n'$?
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo uZmXXhFRmkCzCEZEPjWMQlNiIF
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ PKjewgTIHJcLjaXeCjlasDjfZw
<ip-___-__-_-___.ec2.internal> EXEC stdout line: /home/ubuntu
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 0
<ip-___-__-_-___.ec2.internal> EXEC stdout line: uZmXXhFRmkCzCEZEPjWMQlNiIF
<ip-___-__-_-___.ec2.internal> EXEC remaining: 59
<ip-___-__-_-___.ec2.internal> EXEC remaining: 58
<ip-___-__-_-___.ec2.internal> EXEC remaining: 57
<ip-___-__-_-___.ec2.internal> EXEC remaining: 56
<ip-___-__-_-___.ec2.internal> EXEC remaining: 55
<ip-___-__-_-___.ec2.internal> EXEC remaining: 54
<ip-___-__-_-___.ec2.internal> EXEC remaining: 53
...

After adding a 5 second delay

Loading callback plugin minimal of type stdout, v2.0 from /usr/lib/python3.8/site-packages/ansible/plugins/callback/minimal.py
META: ran handlers
<ip-___-__-_-___.ec2.internal> ESTABLISH SSM CONNECTION TO: i-0cc44a1e2f7995c53
<ip-___-__-_-___.ec2.internal> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', ...]
Sleeping for 5 seconds.
<ip-___-__-_-___.ec2.internal> SSM CONNECTION ID: DanFallon-05193c016e933c229
<ip-___-__-_-___.ec2.internal> EXEC echo ~ubuntu
<ip-___-__-_-___.ec2.internal> _wrap_command: 'echo KylNOeAyCbfmTzlCZGrbHiYnSm
echo ~ubuntu
echo $'\n'$?
echo PujmuNjsUIsmGZzoUwNAUvowBk
'
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: Starting session with SessionId: DanFallon-05193c016e933c229
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ stty -echo ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: PS1='' ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ KylNOeAyCbfmTzlCZGrbHiYnSm
<ip-___-__-_-___.ec2.internal> EXEC stdout line: /home/ubuntu
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 0
<ip-___-__-_-___.ec2.internal> EXEC stdout line: PujmuNjsUIsmGZzoUwNAUvowBk
<ip-___-__-_-___.ec2.internal> EXEC remaining: 60
<ip-___-__-_-___.ec2.internal> EXEC remaining: 59
<ip-___-__-_-___.ec2.internal> EXEC remaining: 58
<ip-___-__-_-___.ec2.internal> EXEC remaining: 57
<ip-___-__-_-___.ec2.internal> EXEC remaining: 56
<ip-___-__-_-___.ec2.internal> EXEC remaining: 55

DanielFallon avatar Sep 30 '20 23:09 DanielFallon

Would love to see these fixes land, and maybe get some tests around this plugin (not sure how that would work).

We bailed on the ssm plugin for now and are running ansible over ssh over ssm, which means we still have to manage ssh keys on instances (though we don't need a bastion any longer).

abeluck avatar Oct 01 '20 15:10 abeluck

Another strange issue that i see related to this during gathering facts is the AnsiballZ_setup.py that is run does something to mess with the shell and always hits EXEC remaining and always times out after without echoing to stdout the new line and the end mark. My remote host is running an older version of python so I'm initially thinking thats related. Trying to stand up a venv on remote to use as env

mikeneiderhauser avatar Oct 01 '20 16:10 mikeneiderhauser

Here's a patch that I'm willing to contribute (haven't gotten a chance to set up a fork yet)

  • Replaces echo with printf because it's more portable and robust
  • waits on receiving command replies before starting additional executions.

Problems I still see:

  • Does aws_ssm.py need to support python2? I don't think my code is compatible. (and I thought boto3 was python3 only, maybe I'm wrong)
  • cmd (line 517) when executed could eat the end_mark command. maybe we could disconnect cmd's stdin? Any solution I thought of for this makes some assumptions about the command or the terminal that I didn't like.

Anywho, I freely release the below under a GPL V3.0+ license like the files from which the were derived. I can look at figuring out the contributor agreement/etc later this week.

diff --git a/plugins/connection/aws_ssm.py b/plugins/connection/aws_ssm.py
index 7f7d692..4dc3965 100644
--- a/plugins/connection/aws_ssm.py
+++ b/plugins/connection/aws_ssm.py
@@ -412,11 +412,97 @@ class Connection(ConnectionBase):
     def _prepare_terminal(self):
         ''' perform any one-time terminal settings '''
 
-        if not self.is_windows:
-            cmd = "stty -echo\n" + "PS1=''\n"
-            cmd = to_bytes(cmd, errors='surrogate_or_strict')
-            self._session.stdin.write(cmd)
+        # No windows setup for now
+        if self.is_windows:
+            return
+
+        # *_complete variables are 3 valued:
+        #   - None: not started
+        #   - False: started
+        #   - True: complete
+
 
+        startup_complete = False
+        startup_reply = re.compile(
+            r"Starting session with SessionId:\W+" +
+            re.escape(self._session_id) +
+            r"\r\n\$ ", re.MULTILINE)
+
+        disable_echo_complete = None
+        disable_echo_cmd = to_bytes("stty -echo\n", errors='surrogate_or_strict')
+        disable_echo_reply = re.compile(
+            r"stty \-echo" +
+            r"\r\r\n\$", re.MULTILINE
+        )
+
+        disable_prompt_complete = None
+        end_mark = "".join([random.choice(string.ascii_letters) for i in xrange(self.MARK_LENGTH)])
+        disable_prompt_cmd = to_bytes(
+            "PS1='' ; printf '\\n%s\\n' '" + end_mark + "'\n",
+            errors='surrogate_or_strict')
+        disable_prompt_reply = re.compile(
+            r"\r\r\n" +
+            re.escape(end_mark) +
+            r"\r\r\n", re.MULTILINE
+        )
+
+        stdout = ""
+        cursor = 0
+        # Custom command execution for when we're waiting for startup
+        stop_time = int(round(time.time())) + self.get_option('ssm_timeout')
+        while (not disable_prompt_complete) and (self._session.poll() is None):
+            remaining = stop_time - int(round(time.time()))
+            if remaining < 1:
+                self._timeout = True
+                display.vvvv(u"PRE timeout stdout: {0}".format(to_bytes(stdout)), host=self.host)
+                raise AnsibleConnectionFailure("SSM start_session timeout on host: %s"
+                                               % self.instance_id)
+            if self._poll_stdout.poll(1000):
+                stdout += to_text(self._stdout.read(1024))
+                display.vvvv(u"PRE stdout line: {0}".format(to_bytes(stdout)), host=self.host)
+            else:
+                display.vvvv(u"PRE remaining: {0}".format(remaining), host=self.host)
+
+            # wait til prompt is ready
+            if startup_complete is False:
+                match = startup_reply.search(stdout,cursor)
+                if match:
+                    display.vvvv(u"PRE startup output received", host=self.host)
+                    cursor = match.end()
+                    startup_complete = True
+                    
+
+            # disable echo
+            if startup_complete and (disable_echo_complete is None):
+                display.vvvv(u"PRE Disabling Echo: {0}".format(disable_echo_cmd), host=self.host)
+                self._session.stdin.write(disable_echo_cmd)
+                disable_echo_complete = False
+
+            if disable_echo_complete is False:
+                match = disable_echo_reply.search(stdout)
+                if match:
+                    stdout = stdout[match.end():]
+                    disable_echo_complete = True
+            
+
+            # disable prompt
+            if disable_echo_complete and disable_prompt_complete is None:
+                display.vvvv(u"PRE Disabling Prompt: {0}".format(disable_prompt_cmd), host=self.host)
+                self._session.stdin.write(disable_prompt_cmd)
+                disable_prompt_complete = False
+
+            if disable_prompt_complete is False:
+                match = disable_prompt_reply.search(stdout)
+                if match:
+                    stdout = stdout[match.end():]
+                    disable_prompt_complete = True
+        
+        if not disable_prompt_complete:
+            raise AnsibleConnectionFailure("SSM process closed during _prepare_terminal on host: %s"
+                                               % self.instance_id)
+        else:
+            display.vvv(u"PRE Terminal configured", host=self.host)
+            
     def _wrap_command(self, cmd, sudoable, mark_start, mark_end):
         ''' wrap command so stdout and status can be extracted '''
 
@@ -427,7 +513,9 @@ class Connection(ConnectionBase):
         else:
             if sudoable:
                 cmd = "sudo " + cmd
-            cmd = "echo " + mark_start + "\n" + cmd + "\necho $'\\n'$?\n" + "echo " + mark_end + "\n"
+            cmd = ("printf '%s\\n' '{0}' ;\n".format(mark_start) +
+                cmd + 
+                " ;\nprintf '\\n%s\\n%s\\n' \"$?\" '{0}' ;\n".format(mark_end))
 
         display.vvvv(u"_wrap_command: '{0}'".format(to_text(cmd)), host=self.host)
         return cmd

DanielFallon avatar Oct 01 '20 17:10 DanielFallon

Im willing to try it because im stuck. My executing node is running py3. My endpoint is only running 2.7 (legacy os). So the plugin being py3 only should be fine. The py2 issues may just be with what ansible packages up and sends over to the endpoint

mikeneiderhauser avatar Oct 01 '20 18:10 mikeneiderhauser

@DanielFallon thanks for the patch. Had to change your ssm_timeout to timeout for the self.get_option call.

upon executing im getting stuck in a PRE remaining loop eg prompt not ready during gathering facts

mikeneiderhauser avatar Oct 01 '20 19:10 mikeneiderhauser

The change in get_option call is because my patch was for the development version instead of the currently released version (they just changed it from timeout to ssm_timeout)

can you post a redacted log with debug level at least 4? (-vvvv) I'd like to see which part of the setup loop it's getting stuck in.

There are sort of 3 phases: first, waits for it to receive the string:

Starting session with SessionId:

second, waits to receive the string:

stty -echo
$

and the third wait's to receive a custom generated end string:

<end_mark>

I expected this code to be a bit fragile but fail safe. Maybe I should add some more detail to the timeout exception

DanielFallon avatar Oct 01 '20 19:10 DanielFallon

I can grab you a log, but may take some time, my inventory script now isn't populating as expected which sets up the s3 bucket name, and instance id.

mikeneiderhauser avatar Oct 01 '20 20:10 mikeneiderhauser

it's also worth noting that I only tested this with a sort of hello world example. I wouldn't be surprised if there are still things that don't work. I do think that some sort of scripted regression testing would be very valuable for this as it is fragile and could be broken by:

  • changes to ansible
  • changes to the ssm session plugin
  • changes to the aws api
  • probably differences in a hosts' /etc/profile

all things to enumerate and test

DanielFallon avatar Oct 01 '20 21:10 DanielFallon

Does your patch apply to the 1.2 tagged release or on the main branch?

mikeneiderhauser avatar Oct 02 '20 13:10 mikeneiderhauser

Please see sanitized output. I took stock 1.2 and added just your patch and got the following

$ ansible-playbook $ANS_VAULT_FLAGS --limit TEST-GRP --user testuser --tags debug-vars common_aws_node.yml -vvvvvv
ansible-playbook 2.9.9
  config file = /home/user/ansibletest/ansible.cfg
  configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/user/venvs/ansible/lib64/python3.6/site-packages/ansible
  executable location = /home/user/venvs/ansible/bin/ansible-playbook
  python version = 3.6.8 (default, Apr  2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Using /home/user/ansibletest/ansible.cfg as config file
Reading vault password file: XXXX
Reading vault password file: XXXX
Reading vault password file: XXXX
setting up inventory plugins
host_list declined parsing /home/user/ansibletest/inventories/dynamic_inv.yml as it did not pass its verify_file() method
Start parsing DYN inventory
Blacklisted hosts
Done parsing DYN inventory
Parsed /home/user/ansibletest/inventories/dynamic_inv.yml inventory source with auto plugin
statically imported: /home/user/ansibletest/roles/basic-provisioning/tasks/10_install_start_aws_ssm.yml
Loading callback plugin default of type stdout, v2.0 from /home/user/venvs/ansible/lib64/python3.6/site-packages/ansible/plugins/callback/default.py

PLAYBOOK: common_aws_node.yml ************************************************************************************************************************************************************************************************************************
Positional arguments: common_aws_node.yml
verbosity: 6
remote_user: testuser
connection: smart
timeout: 10
become_method: sudo
tags: ('debug-vars',)
inventory: ('/home/user/ansibletest/inventories/dynamic_inv.yml',)
subset: TEST-GRP
vault_ids: XXXX
forks: 5
7 plays in common_aws_node.yml

PLAY [provision] *************************************************************************************************************************************************************************************************************************************
META: ran handlers
META: ran handlers
META: ran handlers

--- CUT --- (Skipped roles)

PLAY [debug-vars] ************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************
task path: /home/user/ansibletest/common_aws_node.yml:59
<10.x.x.x> ESTABLISH SSM CONNECTION TO: i-0AWS_INSTANCEID
<10.x.x.x> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "[email protected]", "TokenValue": "XXXCUTXXX", "StreamUrl": "wss://ssmmessages.us-east-1.amazonaws.com/v1/data-channel/[email protected]?role=publish_subscribe", "ResponseMetadata": {"RequestId": "XXX REQ ID XXX", "HTTPStatusCode": 200, "HTTPHeaders": {"server": "Server", "date": "Fri, 02 Oct 2020 14:49:59 GMT", "content-type": "application/x-amz-json-1.1", "content-length": "668", "connection": "keep-alive", "x-amzn-requestid": "XXX amzn id XXX"}, "RetryAttempts": 0}}', 'us-east-1', 'StartSession', '', '{"Target": "i-0AWS_INSTANCEID"}', 'https://ssm.us-east-1.amazonaws.com']
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: [email protected]\r\n'
<10.x.x.x> PRE remaining: 60
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: [email protected]\r\n\x1b[?1034hsh-4.2$ '
<10.x.x.x> PRE remaining: 59
<10.x.x.x> PRE remaining: 58
--- CUT ---
<10.x.x.x> PRE remaining: 3
<10.x.x.x> PRE remaining: 2
<10.x.x.x> PRE remaining: 1
<10.x.x.x> PRE timeout stdout: b'\r\nStarting session with SessionId: [email protected]\r\n\x1b[?1034hsh-4.2$ '
<10.x.x.x> ssm_retry: attempt: 0, cmd (echo ~testuser...), pausing for 0 seconds
<10.x.x.x> CLOSING SSM CONNECTION TO: i-0AWS_INSTANCEID
<10.x.x.x> TERMINATE SSM SESSION: [email protected]
<10.x.x.x> ESTABLISH SSM CONNECTION TO: i-0AWS_INSTANCEID
<10.x.x.x> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "[email protected]", "TokenValue": "XXXCUTXXX", "StreamUrl": "wss://ssmmessages.us-east-1.amazonaws.com/v1/data-channel/[email protected]?role=publish_subscribe", "ResponseMetadata": {"RequestId": "XXX REQ ID XXX", "HTTPStatusCode": 200, "HTTPHeaders": {"server": "Server", "date": "Fri, 02 Oct 2020 14:51:00 GMT", "content-type": "application/x-amz-json-1.1", "content-length": "668", "connection": "keep-alive", "x-amzn-requestid": "XXX amzn id XXX"}, "RetryAttempts": 0}}', 'us-east-1', 'StartSession', '', '{"Target": "i-0AWS_INSTANCEID"}', 'https://ssm.us-east-1.amazonaws.com']
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: [email protected]\r\n'
<10.x.x.x> PRE remaining: 60
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: [email protected]\r\n\x1b[?1034hsh-4.2$ '
<10.x.x.x> PRE remaining: 59
<10.x.x.x> PRE remaining: 58

I also added the bash shell patch as well (but changed bash -l to bash --noprofile -l) with a similar result

mikeneiderhauser avatar Oct 02 '20 17:10 mikeneiderhauser

I modified some of the re's that are in the patch (mainly removing the \r\n's). Now I'm hitting the same issue. The AnsiballZ_setup.py gets transferred and executed. It prints out the json for the facts, but i never get the end mark. As in I never see the second printf output

sudo sudo -H -S -n  -u root /bin/sh -c 'echo BECOME-SUCCESS-lzrbkhycxnugzysliwrhdukemyhhtyde ; /usr/bin/python /home/maintuser/.ansible/tmp/ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py' ;
printf '\n%s\n%s\n' "$?" 'icjRpqBJJHvCxBxLeqZMYeJKhQ' ;

Then back to the retry loop

mikeneiderhauser avatar Oct 02 '20 18:10 mikeneiderhauser

the bash patch is definitely incompatible with mine. (because the regex would definitely have to change.)

does python ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py open stdin? if yes, then it might be eating the printf command (meaning that the shell would never receive it.)

This is one of the things I was worried about in my comment above. I kind of want to look at some other connection plugins and see how they handle this, because I think a lot of pieces of this are pretty fragile and I want to see others ideas.

maybe instead of clearing the prompt, we could set it to an end_mark so that the shell will print it.

DanielFallon avatar Oct 02 '20 19:10 DanielFallon

I tried both sh and bash (with and w/o modding the re's) and both have similar behavior. I do think that AnsiballZ_setup.py is doing something to stdin.

mikeneiderhauser avatar Oct 02 '20 20:10 mikeneiderhauser

This is not an SSM specific solution, but I'm tempted to try out using SendSSHPublicKey from the EC2 instance connect api and then wrapping the ssh connection library instead. Feels like it might be generally more stable (and would support sftp as well)

SSM might still be necessary for the tunnel, but I'm not sure piping commands to stdin is ever really going to be partiuclarly stable.

Any reason why ssmStartSession is superior that I should know about?

DanielFallon avatar Oct 02 '20 22:10 DanielFallon

Looking at this a bit closer, it looks like the right way to address this if use of SSM is wanted is to add a parameter like ssm_exec_document_name. I had not gotten a chance to use SSM much prior to now and wasn't super familiar with its api. The document name in the current version of the api can define most of the behavior of how an interactive session starts, although it unfortunately seems to still start with a raw shell (see here: https://github.com/aws/amazon-ssm-agent/blob/b9654b268afcb7e70a9cc6c6d9b7d2a676f5b468/agent/session/plugins/shell/shell_unix.go#L53 )

Because customization of the shell is so fragile, it's probably worthwhile to provide a sane default and then allow users to override the --document-name provided to the session to configure things further. To facilitate this, the important part is to establish a protocol for communicating stdout and the exit code back to the client (preferably one that is slightly more robust than what is present now so that a user can always produce working results for their system)

I'll give it another half an hour today and post results

DanielFallon avatar Oct 04 '20 20:10 DanielFallon

@DanielFallon would be interested to see your results. Seems to be a step in the right direction.

mikeneiderhauser avatar Oct 05 '20 12:10 mikeneiderhauser

Here is a workaround that avoids sh entirely by starting a bash shell.

--- a/plugins/connection/aws_ssm.py	2020-09-03 18:43:43.818000000 +0200
+++ b/plugins/connection/aws_ssm.py	2020-09-03 18:43:19.805000000 +0200
@@ -288,11 +288,17 @@
 
         profile_name = ''
         region_name = self.get_option('region')
-        ssm_parameters = dict()
 
         client = boto3.client('ssm', region_name=region_name)
         self._client = client
-        response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+
+        if self.is_windows:
+            ssm_parameters = dict()
+            response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+        else:
+            ssm_parameters = {"command": ["bash -l"]}
+            response = client.start_session(Target=self.instance_id, DocumentName="AWS-StartInteractiveCommand", Parameters=ssm_parameters)
+
         self._session_id = response['SessionId']
 
         cmd = [

I think this solution work partially, because is not closing the sessions in session manager. And the system open one session for every one task in Ansible. This causes if I execute a playbook with 10 tasks, 10 session is opening and no closing xd ...

jiba21 avatar Oct 07 '20 08:10 jiba21

The above wont work if there is anything custom about the systems (or users) bash profile that changes what is presented on stdout. I've tried this even with --noprofile and it still hangs for me when gathering facts.

mikeneiderhauser avatar Oct 08 '20 12:10 mikeneiderhauser

@mikeneiderhauser hey btw I did rewrite this using paramiko ssh over the weekend. it's still a bit grody but should work. The biggest problem is that the connections aren't cached. so there are A LOT of aws API calls. I didn't realize that paramiko didn't support persistent connections. I'll have to switch it to borrow from the SSH connection plugin and leverage ssh pipelining instead. Hopefully I can detect whether or not a pipeline is open and not restart the connection from the logic of that plugin. If not, then I'll have to build a disk based cache of some kind for the key timeout.

(Also sorry for the extra cruft trying to cache things, I forgot that ansible forks to a new process for each connection) Here's what I tried and I think this finally should work just as well as the ssh plugin, but it will exhaust your aws api rate limit rather quickly. https://gist.github.com/DanielFallon/dffad373c688da32919e709d6738d715

DanielFallon avatar Oct 09 '20 22:10 DanielFallon

As another workaround, here's a proxy command that should just work to allow connections via ec2-instance-connect and then proxy via ssm:

https://gist.github.com/DanielFallon/45310cc76f46c1f1b2f7272d19b76312

^Technically this is significantly more efficient than the plugin in the gist because SSH Multiplexing will be used if it available. (and is by default)

DanielFallon avatar Oct 09 '20 23:10 DanielFallon

@DanielFallon thanks for this. ill take a look

mikeneiderhauser avatar Oct 12 '20 11:10 mikeneiderhauser

I'll turn this into a pull request later this week, but here's a plugin that uses ec2-instance-connect to push a temporary ssh key to the host and uses this to authenticate.

If paired with a proxycommand for ssm this should be pretty effective: https://gist.github.com/DanielFallon/67572f4439602774d02fbbe3bce8a5b9

DanielFallon avatar Oct 26 '20 14:10 DanielFallon

does python ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py open stdin? if yes, then it might be eating the printf command (meaning that the shell would never receive it.)

Yes I'm finding that the python command eats the printf. I can get around it by adding an echo | {cmd}; to the wrapped command, and the mark end printf seems to appear.

Here's the full patch, making use of your previous patch @DanielFallon

I made some minor changes so we don't use the regexes, as they were being fickle.

diff --git a/playbooks/connection_plugins/aws_ssm2.py b/playbooks/connection_plugins/aws_ssm2.py
index 1c01dba..baa15f3 100644
--- a/playbooks/connection_plugins/aws_ssm2.py
+++ b/playbooks/connection_plugins/aws_ssm2.py
@@ -444,10 +444,94 @@ class Connection(ConnectionBase):
     def _prepare_terminal(self):
         """ perform any one-time terminal settings """

-        if not self.is_windows:
-            cmd = "stty -echo\n" + "PS1=''\n"
-            cmd = to_bytes(cmd, errors="surrogate_or_strict")
-            self._session.stdin.write(cmd)
+        # No windows setup for now
+        if self.is_windows:
+            return
+
+        # *_complete variables are 3 valued:
+        #   - None: not started
+        #   - False: started
+        #   - True: complete
+
+        startup_complete = False
+        disable_echo_complete = None
+        disable_echo_cmd = to_bytes("stty -echo\n", errors="surrogate_or_strict")
+
+        disable_prompt_complete = None
+        end_mark = "".join(
+            [random.choice(string.ascii_letters) for i in xrange(self.MARK_LENGTH)]
+        )
+        disable_prompt_cmd = to_bytes(
+            "PS1='' ; printf '\\n%s\\n' '" + end_mark + "'\n",
+            errors="surrogate_or_strict",
+        )
+        disable_prompt_reply = re.compile(
+            r"\r\r\n" + re.escape(end_mark) + r"\r\r\n", re.MULTILINE
+        )
+
+        stdout = ""
+        # Custom command execution for when we're waiting for startup
+        stop_time = int(round(time.time())) + self.get_option("ssm_timeout")
+        while (not disable_prompt_complete) and (self._session.poll() is None):
+            remaining = stop_time - int(round(time.time()))
+            if remaining < 1:
+                self._timeout = True
+                display.vvvv(
+                    "PRE timeout stdout: {0}".format(to_bytes(stdout)), host=self.host
+                )
+                raise AnsibleConnectionFailure(
+                    "SSM start_session timeout on host: %s" % self.instance_id
+                )
+            if self._poll_stdout.poll(1000):
+                stdout += to_text(self._stdout.read(1024))
+                display.vvvv(
+                    "PRE stdout line: {0}".format(to_bytes(stdout)), host=self.host
+                )
+            else:
+                display.vvvv("PRE remaining: {0}".format(remaining), host=self.host)
+
+            # wait til prompt is ready
+            if startup_complete is False:
+                match = str(stdout).find("Starting session with SessionId")
+                if match != -1:
+                    display.vvvv("PRE startup output received", host=self.host)
+                    startup_complete = True
+
+            # disable echo
+            if startup_complete and (disable_echo_complete is None):
+                display.vvvv(
+                    "PRE Disabling Echo: {0}".format(disable_echo_cmd), host=self.host
+                )
+                self._session.stdin.write(disable_echo_cmd)
+                disable_echo_complete = False
+
+            if disable_echo_complete is False:
+                match = str(stdout).find("stty -echo")
+                if match != -1:
+                    disable_echo_complete = True
+
+            # disable prompt
+            if disable_echo_complete and disable_prompt_complete is None:
+                display.vvvv(
+                    "PRE Disabling Prompt: {0}".format(disable_prompt_cmd),
+                    host=self.host,
+                )
+                self._session.stdin.write(disable_prompt_cmd)
+                disable_prompt_complete = False
+
+            if disable_prompt_complete is False:
+                match = disable_prompt_reply.search(stdout)
+                if match:
+                    stdout = stdout[match.end() :]
+                    disable_prompt_complete = True
+
+        if not disable_prompt_complete:
+            raise AnsibleConnectionFailure(
+                "SSM process closed during _prepare_terminal on host: %s"
+                % self.instance_id
+            )
+        else:
+            display.vvv("PRE Terminal configured", host=self.host)

     def _wrap_command(self, cmd, sudoable, mark_start, mark_end):
         """ wrap command so stdout and status can be extracted """
@@ -460,14 +544,9 @@ class Connection(ConnectionBase):
             if sudoable:
                 cmd = "sudo " + cmd
             cmd = (
-                "echo "
-                + mark_start
-                + "\n"
-                + cmd
-                + "\necho $'\\n'$?\n"
-                + "echo "
-                + mark_end
-                + "\n"
+                f"printf '%s\\n' '{mark_start}';\n"
+                f"echo | {cmd};\n"
+                f"printf '\\n%s\\n%s\\n' \"$?\" '{mark_end}';\n"
             )

         display.vvvv(u"_wrap_command: '{0}'".format(to_text(cmd)), host=self.host)

I put the patched version of aws_ssm2.py into my playbooks directory in connection_plugins/ dir, set ansible_connection: aws_ssm2, and was able to successfully use ssm to run playbooks against my Ubuntu EC2 instance.

thomas-anderson-bsl avatar Feb 25 '21 06:02 thomas-anderson-bsl

Hi, is there an official solution? I am facing the same problem.

jonormann avatar Apr 21 '21 04:04 jonormann

Hi,

is there an official solution?

I am facing the same problem.

Same

wilinger avatar Apr 29 '21 17:04 wilinger

does python ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py open stdin? if yes, then it might be eating the printf command (meaning that the shell would never receive it.)

Yes I'm finding that the python command eats the printf. I can get around it by adding an echo | {cmd}; to the wrapped command, and the mark end printf seems to appear.

Here's the full patch, making use of your previous patch @DanielFallon

I made some minor changes so we don't use the regexes, as they were being fickle.

[...]

I put the patched version of aws_ssm2.py into my playbooks directory in connection_plugins/ dir, set ansible_connection: aws_ssm2, and was able to successfully use ssm to run playbooks against my Ubuntu EC2 instance.

Even with the patched code from https://github.com/Filirom1/community.aws/blob/cb79826540fc58a2d13c1d95dde91c7544578748/plugins/connection/aws_ssm.py#L524 and your instructions, I still experience an issue:

ansible-playbook -i inventory_aws_ssm.yml -c aws_ssm2 playbook.yml

#playbook.yml
---
- name: Test command
  gather_facts: false
  hosts: all
  vars:
    ansible_connection: aws_ssm2
    ansible_aws_ssm_region: eu-central-1
  tasks:
    - name: test
      command:
        cmd: ls -l
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TypeError: expected string or bytes-like object
fatal: [instance-id]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

Update fixing the new issue: TLDR; Use the patched codewith @thomas-anderson-bsl instructions and ALSO set the var.ansible_aws_ssm_bucket_name. The bucket name is required and there is no graceful exit if it's not supplied.

Apparently the plugin requires the ansible_aws_ssm_bucket_name to be set, otherwise the run fails. I am still not sure why it tries to run _file_transport_command on a cmd: ls -l command.. Maybe someone else can clarify that. Anyway, with the patched code and setting the bucket variable - make sure your account has access rights to it-, I can connect to the instance 👍

lassebenni avatar May 06 '21 15:05 lassebenni