mitogen 0.3.3 + ansible 2.12.8+: Broker has exitted
Hi,
I'm experiencing a strange issue when using ansible 2.12.8 and later with mitogen 0.3.3.
When running my (quite long running) playbook on more than 8 hosts, mitogen exits on (quite random, like hostname, systemd, service, template, make, …) tasks (but all hosts at the same time) with:
Traceback (most recent call last):
File "/home/myuser/tmp/ansible/lib/ansible/executor/task_executor.py", line 158, in run
res = self._execute()
File "/home/myuser/tmp/ansible/lib/ansible/executor/task_executor.py", line 605, in _execute
result = self._handler.run(task_vars=variables)
File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/mixins.py", line 146, in run
return super(ActionModuleMixin, self).run(tmp, task_vars)
File "/home/myuser/tmp/ansible/lib/ansible/plugins/action/normal.py", line 47, in run
result = merge_hash(result, self._execute_module(task_vars=task_vars, wrap_async=wrap_async))
File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/mixins.py", line 376, in _execute_module
self._set_temp_file_args(module_args, wrap_async)
File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/mixins.py", line 355, in _set_temp_file_args
self._connection.get_good_temp_dir()
File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/connection.py", line 832, in get_good_temp_dir
self._connect()
File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/connection.py", line 854, in _connect
self._connect_stack(stack)
File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/connection.py", line 801, in _connect_stack
dct = mitogen.service.call(
File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/service.py", line 126, in call
return call_context.call_service(service_name, method_name, **kwargs)
File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/core.py", line 2314, in call_service
return recv.get().unpickle()
File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/core.py", line 1195, in get
msg._throw_dead()
File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/core.py", line 935, in _throw_dead
raise ChannelError(self.data.decode('utf-8', 'replace'))
mitogen.core.ChannelError: Broker has exitted
Running with 8 hosts or less or using ansible 2.12.7 and below works fine. Reducing ansible forks or MITOGEN_POOL_SIZE doesn't help.
I narrowed down the change in ansible that broke the playbook execution to https://github.com/ansible/ansible/commit/45185b03e20cb7a113a3ac7238e4a924ac1846a7 so reverting this commit fixes the problem.
Any ideas of what could be the incompatibility here?
facing the same issue, ansible version is: 2.13.4 mitogen version: V0.3.4-beta
Wating for the fixes.
Experiencing the same issues. Commenting out the line as per the comment here seemed to fix.
With 0af2ce8c30f81adaa254d3d0308a0ed4410a7b65 this close statement was reworked but that didn't fix it.
The error is slightly different, though:
ERROR! [task 411936] 09:13:05.331167 E mitogen: broker crashed
Traceback (most recent call last):
File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 3588, in _do_broker_main
self._loop_once()
File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 3543, in _loop_once
for side, func in self.poller.poll(timeout):
File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 2465, in _poll
(rfds, wfds, _), _ = io_op(select.select,
^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 567, in io_op
return func(*args), None
^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
Unfortunately, the only way I'm aware of to mitigate this is to downgrade to ansible 2.12.7.
Found the issue. select() is limited to 1024 fds and we need to use poll() here. Which is already implemented.
In https://github.com/mitogen-hq/mitogen/blob/master/ansible_mitogen/process.py#L282 the poller is reset to mitogen.core.Poller which is contraproductive here.
Just ~remove the class~ replace the poller_class = line with pass and be happy.
Found the issue.
select()is limited to 1024 fds and we need to usepoll()here. Which is already implemented. In https://github.com/mitogen-hq/mitogen/blob/master/ansible_mitogen/process.py#L282 the poller is reset tomitogen.core.Pollerwhich is contraproductive here. Just ~remove the class~ replace thepoller_class =line withpassand be happy.
This workaround did not work for me, when running on 50+ hosts the playbook just "freeze" , revert to ansible 2.12.7 also did not work at the moment, I have to pursue investigations
I'm not sure about all details here, but ansible-mitogen uses CPU pinning onto first two CPUs (you can see it when you run ansible with mitogen_linear with big number of hosts and forks, only first two CPUs are 100% busy).
The more hosts you have to run, the more congested those CPUs become, and everything slows down. That may explain 'freeze' behavior.
I've solved that problem by running deployment in parallel from multiple hosts (github actions) with --limit, where each runner runs playbook for a single host.