mitogen icon indicating copy to clipboard operation
mitogen copied to clipboard

SSH connect failures on Mitogen 0.2.9 on WSL Ubuntu 18.04

Open gchaix opened this issue 5 years ago • 18 comments

I'm seeing consistent failures when trying to connect via SSH when multiple hosts are specified in the inventory:

TASK [Gathering Facts] **********************************************************************************************************************************************ERROR! [mux  15260] 10:54:20.330539 E mitogen: <Stream ssh.stage-web1 #6e10> crashed 
Traceback (most recent call last):
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 2033, in write 
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/xxx/ansible/plugins/mitogen-0.2.9/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web1]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ok: [stage-web2] 

One host connects, all of the host connections other fail. If there are more than two hosts in the inventory, all but one fail with the same errors. Repeated runs show that the host that fails appears to be random.

PLAY RECAP **********************************************************************************************************************************************************
prod-solr1         : ok=0    changed=0    unreachable=1    failed=0
prod-solr2        : ok=0    changed=0    unreachable=1    failed=0
prod-solr3         : ok=0    changed=0    unreachable=1    failed=0    
prod-util1 : ok=8    changed=0    unreachable=0    failed=0
prod-web1          : ok=0    changed=0    unreachable=1    failed=0
prod-web2          : ok=0    changed=0    unreachable=1    failed=0
prod-web3          : ok=0    changed=0    unreachable=1    failed=0

Environment: Mitogen 0.2.9 Windows 10 Pro, V. 1809, OS build 17763.914 WSL Ubuntu 18.04.3 LTS ansible 2.7.11 config file = /home/gchaix/repos/xxx/ansible/ansible.cfg configured module search path = [u'/home/gchaix/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/gchaix/.local/lib/python2.7/site-packages/ansible executable location = /home/gchaix/.local/bin/ansible python version = 2.7.15+ (default, Oct 7 2019, 17:39:04) [GCC 7.4.0] Host target OS is generally CentOS 7.x but this also appears to be happening with other distros (Ubuntu, etc.)

No patches on Ansible or Mitogen. I tried running it with Mitogen current master, same behavior. This feels like it might be related to #319 but I'm not familiar enough with the internals of WSL to really say for certain. Interestingly, running Ansible with -vvv seems to bypass the issue, as all host connections succeed, whereas running with just --verbose produces failure and the output above.

gchaix avatar Jan 13 '20 19:01 gchaix

Hi,

We are experiencing the exact same issue when running a playbook in WSL with Ubuntu over multiple hosts. There are no issues when running a playbook with a single host or when running with -vvv over multiple hosts.

Edit: Running with MITOGEN_ROUTER_DEBUG=1 also "solves" the problem without having to use -vvv but leaves a log file behind on each target host.

I would gladly help out with additional troubleshooting but I need some pointers on where to start.

Environment: WSL/Ubuntu: Ubuntu 18.04.1 LTS Windows 10 V. 1809, OS build 18363.592 Ansible: 2.9.4 Mitogen: 0.2.9

atoom avatar Feb 11 '20 08:02 atoom

Same thing (

konstantin-kornienko avatar Apr 17 '20 13:04 konstantin-kornienko

Same here, single connection works fine (--limit single host), else I get the same error.

Using WSL1 Debian Buster

kevinvalk avatar Apr 22 '20 10:04 kevinvalk

Could someone try latest master again? I don't have a WSL env to test with unfortunately :( I have noticed other unrelated tasks have failed though with different amounts of -v applied; perhaps it's a bigger issue than specifically WSL-related 🤔

s1113950 avatar Apr 23 '20 23:04 s1113950

I'm still seeing failures on master @ a5fe4a9f

ansible-playbook 2.9.6
  config file = /home/gchaix/repos/project/ansible/ansible.cfg
  configured module search path = [u'/home/gchaix/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/gchaix/.local/lib/python2.7/site-packages/ansible
  executable location = /home/gchaix/.local/bin/ansible-playbook
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]
Using /home/gchaix/repos/project/ansible/ansible.cfg as config file
TASK [Gathering Facts] *******************************************************************************************************************************************************************ERROR! [mux  734] 12:05:11.015470 E mitogen: <Stream ssh.stage-web2.bak #1050> crashed
Traceback (most recent call last):
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web2.bak]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ERROR! [mux  734] 12:05:11.303791 E mitogen: <Stream ssh.stage-web1.bak #b8d0> crashed
Traceback (most recent call last):
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 3481, in _call
    func(self)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/home/gchaix/repos/project/ansible/plugins/mitogen-head/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [stage-web1.bak]: UNREACHABLE! => {"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
ok: [prod-util1.bak]

gchaix avatar Apr 27 '20 19:04 gchaix

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

s1113950 avatar Apr 28 '20 18:04 s1113950

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

You can probably run the azure devops agent inside a WSL instance and use that as the agent pool in your devops pipeline.

arnemorten avatar Apr 29 '20 18:04 arnemorten

Also reproducible for me most of the time, it seems much more prone to doing it on "copy" tasks for some reason.

I'm surprised because it was all working fine a while ago, so I suspect WSL has updated or something.

If I can help with any debug details let me know and I will try.

rdghickman avatar Jul 01 '20 14:07 rdghickman

Does anyone know if there's a way to get a WSL machine to test with? We use Azure Devops to test but afaik there's no WSL env we can enable

You can probably run the azure devops agent inside a WSL instance and use that as the agent pool in your devops pipeline.

We'd need a WSL instance for that right? 🤔 is there an OSS-supported test env (like Travis, Circle, Azure devops, etc) that offer WSL instances?

s1113950 avatar Jul 01 '20 20:07 s1113950

Also reproducible for me most of the time, it seems much more prone to doing it on "copy" tasks for some reason.

I'm surprised because it was all working fine a while ago, so I suspect WSL has updated or something.

If I can help with any debug details let me know and I will try.

I wonder if WSL added a timeout on connection or something? 🤔 The error of the respondent Context has disconnected is reflecting that the connection was broken somehow. Did it work for WSL1 but not WSL2?

s1113950 avatar Jul 01 '20 20:07 s1113950

I'm still on WSL1 and definitely seeing the problem. Sadly, I don't know of any test envs that provide WSL instances to test.

gchaix avatar Jul 01 '20 21:07 gchaix

Could it be due to an ssh timeout error maybe? I found https://www.reddit.com/r/bashonubuntuonwindows/comments/bj617c/how_to_keep_wsl_shell_open_when_ssh_session/ . Wild shot in the dark but if it used to work with the same code and now doesn't then maybe WSL changed their default ssh session connection time?

s1113950 avatar Jul 01 '20 21:07 s1113950

I'll dig through the linked post and do some experimenting but an initial look through it doesn't seem to apply, as there is no delay at all between the success and failures. One - and only one - random machine always succeeds and the others immediately fail. It feels more like when it is trying to open a bunch of SSH connections in parallel but only one is being allowed, the rest are immediately rejected by the underlying subsystems (networking maybe?). It's important to note that for me, at least, I'm not sure it ever worked properly. I don't think I tried connecting to an inventory with multiple hosts on WSL before encountering this problem.

gchaix avatar Jul 01 '20 22:07 gchaix

Ok. I'm not too sure why the underlying subsystems would be rejecting the other connections 😞 maybe @dw knows? He fixed WSL stuff last time: https://github.com/dw/mitogen/commit/22bab87821a02ed8cb6b3eb4b52c766a8f5cfee7 and https://github.com/dw/mitogen/commit/56943d3141c95a25b376d4dcfe01741d22f78bdf . I do see other ssh-related WSL issues have been filed in the past: https://github.com/microsoft/WSL/issues/3503, not sure if relevant though.

s1113950 avatar Jul 01 '20 22:07 s1113950

Just as an additional point, I am seeing the failures and I am only targeting a single host. I agree it seems like a very quick failure.

rdghickman avatar Jul 08 '20 07:07 rdghickman

Anyone tried WSL2 yet with this?

rdghickman avatar Jul 31 '20 11:07 rdghickman

Just to chime in with a possible workaround, I was able to work around this by disabling the Windows Defender firewall. I'm not sure why that solves it. All prior steps in the playbook execute successfully. I can also confirm the LAN IP the playbook was run against is accessible with both the firewall on and off.

The task in the playbook is:

- name: Upload redacted package
  copy:
    dest: "/tmp/"
    src: "{{ latest_redacted_builds[ansible_distribution][ansible_distribution_major_version] }}"
    backup: yes
    owner: root
    group: root
  register: redacted_upload
  tags: [config, redacted-binary]

And the backtrace from the failed execution of the task is:

TASK [redacted: Upload redacted package] ****************************************************************************************************
ERROR! [mux  4321] 13:30:16.461182 E mitogen: <Stream ssh.192.168.122.236 #7c10> crashed
Traceback (most recent call last):
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 3481, in _call
    func(self)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 1719, in on_transmit
    self.protocol.on_transmit(broker)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 2167, in on_transmit
    self._writer.on_transmit(broker)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 1907, in on_transmit
    written = self._protocol.stream.transmit_side.write(buf)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 2033, in write
    written, disconnected = io_op(os.write, self.fd, s)
  File "/mnt/c/ansible/plugins/mitogen-0.2.10-rc.0/mitogen/core.py", line 553, in io_op
    return func(*args), None
OSError: [Errno 11] Resource temporarily unavailable
fatal: [192.168.122.236]: UNREACHABLE! => {
    "changed": false,
    "unreachable": true
}

My platform is WSL1 with Ubuntu 18.04.3 LTS, on Windows 10 1904.985.

asantoni avatar Jun 03 '21 18:06 asantoni

Hello, Same issue here on a more recent config with WSL 1 and Ubuntu 20.04. Tested with mitogen tag v2.10rc1 (also tested 0.2.9 unsuccessfully). An example of error message here: bugmitogen1 Like the others, -vvv option works well, but without it mitogen will choose one host to perform ansible tasks execution. Hope it helps

ginolegigot avatar Oct 25 '21 10:10 ginolegigot