pywinrm
pywinrm copied to clipboard
Most WinRM operations can fail on low timeout
Pywinrm 0.2.0 lowered the default operation timeouts from 60s to 20s (and the read timeout to 30s) to allow for faster failures on dead hosts, etc. This generally isn't a problem, but for newly-created (and possibly for heavily-loaded) instances on AWS, the lower timeout can expire before the initial shell gets created (and possibly in other places).
Rather than lowering the timeout, we could add polling for it to other places known to run long (starting with open_shell) similar to what we do during receive.
The failure manifests as a 500, eg:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/ansible/plugins/connection/winrm.py", line 151, in _winrm_connect
self.shell_id = protocol.open_shell(codepage=65001) # UTF-8
File "/usr/local/lib/python2.7/dist-packages/winrm/protocol.py", line 132, in open_shell
res = self.send_message(xmltodict.unparse(req))
File "/usr/local/lib/python2.7/dist-packages/winrm/protocol.py", line 207, in send_message
return self.transport.send_message(message)
File "/usr/local/lib/python2.7/dist-packages/winrm/transport.py", line 191, in send_message
raise WinRMTransportError('http', error_message)
WinRMTransportError: (u'http', u'Bad HTTP response returned from server. Code 500')
See https://github.com/ansible/ansible/issues/16873 for original issue.
We should probably also add checks for any 500 errors to see if there's still readable SOAP, and if so, propagate the error message contained inside.
BTW I believe this is causing the scan jobs for windows machines on tower 3.0.1 to also fail.
@smapjb That shouldn't be the case- this is a very specific failure to the first request against a brand-new machine (so any Ansible winrm op would fail, not just scan jobs)... The actual execution of a module is not subject to this failure (we already have retries there). Unless the scan job happens to be the first thing to run against a new machine...
I'm struggling with this error. I spin up new windows 2K8R2 and 2K12R2 machines with ConfigureRemoting.ps1 script already provisioned in the machine templates. I have the workaround in my inventory / group vars. The first task on the host immediately replies with this error (no timeout seems to be happening.)
If I run a script file (below) I created to check response to the open port, I get immediate feedback that the host is online. I cannot ping the host (since ipmc response is turned off at the hosts firewall).
I have also tried setting "winrm set winrm/config/service @{AllowUnencrypted="true"}"
in the Powershell, to no avail.
Are there any other things I could try?
#!/bin/bash
if [ $# -eq 0 ]; then
echo "please provide hostname or ip address"
exit 1
fi
url=$1
output=$(curl -s -f -k -m 10 --header "Content-Type: application/soap+xml;charset=UTF-8" --header "WSMANIDENTIFY: unauthenticated" http://${url}:5985/wsman --data "<s:Envelope xmlns:s=http://www.w3.org/2003/05/soap-envelope xmlns:wsmid=http://schemas.dmtf.org/wbem/wsman/identity/1/wsmanidentity.xsd><s:Header/><s:Body><wsmid:Identify/></s:Body></s:Envelope>" || true);
if [[ $output == *"IdentifyResponse"* ]]
then
echo "Online";
else
echo "Offline";
fi
We also found such a problem last week when network was slow/not stable. The same 500 error code. I've changed default operation timeout/ read timeout and hope it'll help to resolve issue.
I've done a small script that was pinging machine restored from snapshot on esxi. When it succeeds, I start copying zip file to this node. Here is the output in case you're interested:
[18:35:37][Step 9/13] --- 192.168.0.52 ping statistics ---
[18:35:37][Step 9/13] 3 packets transmitted, 3 received, 0% packet loss, time 2000ms
[18:35:37][Step 9/13] rtt min/avg/max/mdev = 0.314/14.519/42.782/19.985 ms
[18:35:37][Step 9/13] ping success
[18:35:38][Step 9/13] Using /etc/ansible/ansible.cfg as config file
[18:35:38][Step 9/13] Loading callback plugin default of type stdout, v2.0 from /usr/lib/python2.7/dist-packages/ansible/plugins/callback/__init__.pyc
[18:35:38][Step 9/13]
[18:35:38][Step 9/13] PLAYBOOK: copy_repo.yml ********************************************************
[18:35:38][Step 9/13] 1 plays in copy_repo.yml
[18:35:38][Step 9/13]
[18:35:38][Step 9/13] PLAY [192.168.0.52] ************************************************************
[18:35:38][Step 9/13]
[18:35:38][Step 9/13] TASK [Gathering Facts] *********************************************************
[18:35:38][Step 9/13] Using module file /usr/lib/python2.7/dist-packages/ansible/modules/windows/setup.ps1
[18:35:38][Step 9/13] <192.168.0.52> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 192.168.0.52
[18:35:39][Step 9/13] EXEC (via pipeline wrapper)
[18:35:59][Step 9/13] [WARNING]: FATAL ERROR DURING FILE TRANSFER: Traceback (most recent call
[18:35:59][Step 9/13] last): File "/usr/lib/python2.7/dist-
[18:35:59][Step 9/13] packages/ansible/plugins/connection/winrm.py", line 267, in _winrm_exec
[18:35:59][Step 9/13] self._winrm_send_input(self.protocol, self.shell_id, command_id, data,
[18:35:59][Step 9/13] eof=is_last) File "/usr/lib/python2.7/dist-
[18:35:59][Step 9/13] packages/ansible/plugins/connection/winrm.py", line 248, in _winrm_send_input
[18:35:59][Step 9/13] protocol.send_message(xmltodict.unparse(rq)) File "/usr/local/lib/python2.7
[18:35:59][Step 9/13] /dist-packages/winrm/protocol.py", line 207, in send_message return
[18:35:59][Step 9/13] self.transport.send_message(message) File "/usr/local/lib/python2.7/dist-
[18:35:59][Step 9/13] packages/winrm/transport.py", line 202, in send_message raise
[18:35:59][Step 9/13] WinRMTransportError('http', error_message) WinRMTransportError: (u'http', u'Bad
[18:35:59][Step 9/13] HTTP response returned from server. Code 500')
[18:35:59][Step 9/13] fatal: [192.168.0.52]: FAILED! => {
[18:35:59][Step 9/13] "failed": true,
[18:35:59][Step 9/13] "msg": "winrm send_input failed"
[18:35:59][Step 9/13] }
[18:35:59][Step 9/13] to retry, use: --limit @/home/test/buildAgent/work/2165bffc95ace925/my_repo/copy_repo.retry
[18:35:59][Step 9/13]
[18:35:59][Step 9/13] PLAY RECAP *********************************************************************
[18:35:59][Step 9/13] 192.168.0.52 : ok=0 changed=0 unreachable=0 failed=1
@KroArtem pinging is probably not enough when starting a server up as the WinRM services still need to come online for it to start. You would be better off
- Using wait_for and waiting for port 5986 to come online
- Finally using wait_for_connection to try and connect to the host once the port is online.
Hopefully the solves your issues.
@jborean93, usually it works ok as windows server is almost clean with some small tweaks so it loads really fast. I had problems only when we had network issues and increasing timeout helped.
Anyway, your advice look really helpful and reasonable and I'll use it later, thanks!
Also, I was getting the error above intermittently after startup. Turns out a process on the Windows server was pegging the CPU and as a result the call was hanging (and if it didn't get the error on first connection, every command would take a really long time). Just throwing this up here in case the intrepid Googler stumbles upon this.
I was struggling with exception WinRMTransportError before, especially send WinRM command to the machine just bootup. After digging into the real error message that WinRM sent back. I list down every error happened in my environment and if it happened, I will retry again.
Illegal operation attempted on a registry key that has been marked for deletion.
similar issue also report here: https://github.com/test-kitchen/winrm-transport/issues/18
<f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="2147943418" Machine="windows-host">
<f:Message>
<f:ProviderFault provider="Shell cmd plugin" path="%systemroot%\\system32\\winrscmd.dll">Illegal operation attempted on a registry key that has been marked for deletion. </f:ProviderFault>
</f:Message>
</f:WSManFault>
Class not registered
dont know the root cause till now
<s:Detail>
<f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="2147746132" Machine="windows-host">
<f:Message>
<f:ProviderFault provider="Shell cmd plugin" path="%systemroot%\\system32\\winrscmd.dll">Class not registered </f:ProviderFault>
</f:Message>
</f:WSManFault>
</s:Detail>
socket read timeout
I guess the windows machine is busy and not able to response in a short given timeout
OperationTimeout
This error is handled by pywinrm itself here. however, sometime this error may only contains the error code 2150858793 but without the string http://schemas.microsoft.com/wbem/wsman/1/windows/shell/Receive So I will use error code to capture this error and retry again
<s:Detail>
<f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="2150858793" Machine="windows-host">
<f:Message>The WS-Management service cannot complete the operation within the time specified in OperationTimeout. </f:Message>
</f:WSManFault>
</s:Detail>
I faced the same issue when using with Kerberos Auth. I already set ansible_winrm_read_timeout_sec=300 and ansible_winrm_operation_timeout_sec=290. But it cannot resolve. The issue occurs sometimes, but it doesn't occur sometimes. I mean not only for the first time. The errors detail is below:
redirecting (type: modules) ansible.builtin.win_ping to ansible.windows.win_ping Using module file /root/.local/lib/python3.8/site-packages/ansible_collections/ansible/windows/plugins/modules/win_ping.ps1 Pipelining is enabled. <N2N-W-CD-SITQA2.N2NSIT.LOCAL> ESTABLISH WINRM CONNECTION FOR USER: [email protected] on PORT 5986 TO N2N-W-CD-SITQA2.N2NSIT.LOCAL creating Kerberos CC at /tmp/tmpvy1pt4n2 calling kinit with pexpect for principal [email protected] kinit succeeded for principal [email protected] <N2N-W-CD-SITQA2.N2NSIT.LOCAL> WINRM CONNECT: transport=kerberos endpoint=https://N2N-W-CD-SITQA2.N2NSIT.LOCAL:5986/wsman <N2N-W-CD-SITQA2.N2NSIT.LOCAL> WINRM CONNECTION ERROR: Server execution failed (extended fault data: {'transport_message': 'Bad HTTP response returned from server. Code 500', 'http_status_code': 500, 'wsmanfault_code': '2148007941', 'fault_code': 's:Receiver', 'fault_subcode': 'w:InternalError'}) Traceback (most recent call last): File "/usr/lib/python3/dist-packages/winrm/transport.py", line 271, in _send_message_request response.raise_for_status() File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: for url: https://n2n-w-cd-sitqa2.n2nsit.local:5986/wsman
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/winrm/protocol.py", line 234, in send_message resp = self.transport.send_message(message) File "/usr/lib/python3/dist-packages/winrm/transport.py", line 265, in send_message response = self._send_message_request(prepared_request, message) File "/usr/lib/python3/dist-packages/winrm/transport.py", line 282, in _send_message_request raise WinRMTransportError('http', ex.response.status_code, response_text) winrm.exceptions.WinRMTransportError: Bad HTTP response returned from server. Code 500
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/root/.local/lib/python3.8/site-packages/ansible/plugins/connection/winrm.py", line 449, in _winrm_connect self.shell_id = protocol.open_shell(codepage=65001) # UTF-8 File "/usr/lib/python3/dist-packages/winrm/protocol.py", line 157, in open_shell res = self.send_message(xmltodict.unparse(req)) File "/usr/lib/python3/dist-packages/winrm/protocol.py", line 272, in send_message raise WinRMError('{0} (extended fault data: {1})'.format(error_message, fault_data)) winrm.exceptions.WinRMError: Server execution failed (extended fault data: {'transport_message': 'Bad HTTP response returned from server. Code 500', 'http_status_code': 500, 'wsmanfault_code': '2148007941', 'fault_code': 's:Receiver', 'fault_subcode': 'w:InternalError'}) fatal: [N2N-W-CD-SITQA2]: UNREACHABLE! => { "changed": false, "msg": "kerberos: Server execution failed (extended fault data: {'transport_message': 'Bad HTTP response returned from server. Code 500', 'http_status_code': 500, 'wsmanfault_code': '2148007941', 'fault_code': 's:Receiver', 'fault_subcode': 'w:InternalError'})", "unreachable": true }
Can someone help me with it? ps: I already upgrade to pywinrm 0.4.3