mitogen icon indicating copy to clipboard operation
mitogen copied to clipboard

mitogen 0.3.3 + ansible 2.12.8+: Broker has exitted

Open philfry opened this issue 3 years ago • 6 comments

Hi,

I'm experiencing a strange issue when using ansible 2.12.8 and later with mitogen 0.3.3. When running my (quite long running) playbook on more than 8 hosts, mitogen exits on (quite random, like hostname, systemd, service, template, make, …) tasks (but all hosts at the same time) with:

Traceback (most recent call last):
  File "/home/myuser/tmp/ansible/lib/ansible/executor/task_executor.py", line 158, in run
    res = self._execute()
  File "/home/myuser/tmp/ansible/lib/ansible/executor/task_executor.py", line 605, in _execute
    result = self._handler.run(task_vars=variables)
  File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/mixins.py", line 146, in run
    return super(ActionModuleMixin, self).run(tmp, task_vars)
  File "/home/myuser/tmp/ansible/lib/ansible/plugins/action/normal.py", line 47, in run
    result = merge_hash(result, self._execute_module(task_vars=task_vars, wrap_async=wrap_async))
  File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/mixins.py", line 376, in _execute_module
    self._set_temp_file_args(module_args, wrap_async)
  File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/mixins.py", line 355, in _set_temp_file_args
    self._connection.get_good_temp_dir()
  File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/connection.py", line 832, in get_good_temp_dir
    self._connect()
  File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/connection.py", line 854, in _connect
    self._connect_stack(stack)
  File "/home/myuser/playbooks/plugins/strategy/mitogen/ansible_mitogen/connection.py", line 801, in _connect_stack
    dct = mitogen.service.call(
  File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/service.py", line 126, in call
    return call_context.call_service(service_name, method_name, **kwargs)
  File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/core.py", line 2314, in call_service
    return recv.get().unpickle()
  File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/core.py", line 1195, in get
    msg._throw_dead()
  File "/home/myuser/playbooks/plugins/strategy/mitogen/mitogen/core.py", line 935, in _throw_dead
    raise ChannelError(self.data.decode('utf-8', 'replace'))
mitogen.core.ChannelError: Broker has exitted

Running with 8 hosts or less or using ansible 2.12.7 and below works fine. Reducing ansible forks or MITOGEN_POOL_SIZE doesn't help.

I narrowed down the change in ansible that broke the playbook execution to https://github.com/ansible/ansible/commit/45185b03e20cb7a113a3ac7238e4a924ac1846a7 so reverting this commit fixes the problem.

Any ideas of what could be the incompatibility here?

philfry avatar Sep 27 '22 08:09 philfry

facing the same issue, ansible version is: 2.13.4 mitogen version: V0.3.4-beta

Wating for the fixes.

matrixkloud avatar Sep 28 '22 17:09 matrixkloud

Experiencing the same issues. Commenting out the line as per the comment here seemed to fix.

ryan-u410 avatar Oct 29 '22 01:10 ryan-u410

With 0af2ce8c30f81adaa254d3d0308a0ed4410a7b65 this close statement was reworked but that didn't fix it.

The error is slightly different, though:

ERROR! [task 411936] 09:13:05.331167 E mitogen: broker crashed                                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                                                                             
  File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 3588, in _do_broker_main                                                                                                                                                 
    self._loop_once()                                                                                                                                                                                                                          
  File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 3543, in _loop_once                                                                                                                                                      
    for side, func in self.poller.poll(timeout):                                                                                                                                                                                               
  File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 2465, in _poll                                                                                                                                                           
    (rfds, wfds, _), _ = io_op(select.select,                                                                                                                                                                                                  
                         ^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                  
  File "/home/myuser/projects/3rdparty/mitogen/mitogen/core.py", line 567, in io_op                                                                                                                                                            
    return func(*args), None                                                                                                                                                                                                                   
           ^^^^^^^^^^^                                                                                                                                                                                                                         
ValueError: filedescriptor out of range in select()

Unfortunately, the only way I'm aware of to mitigate this is to downgrade to ansible 2.12.7.

philfry avatar Feb 27 '23 08:02 philfry

Found the issue. select() is limited to 1024 fds and we need to use poll() here. Which is already implemented. In https://github.com/mitogen-hq/mitogen/blob/master/ansible_mitogen/process.py#L282 the poller is reset to mitogen.core.Poller which is contraproductive here. Just ~remove the class~ replace the poller_class = line with pass and be happy.

philfry avatar Mar 24 '23 11:03 philfry

Found the issue. select() is limited to 1024 fds and we need to use poll() here. Which is already implemented. In https://github.com/mitogen-hq/mitogen/blob/master/ansible_mitogen/process.py#L282 the poller is reset to mitogen.core.Poller which is contraproductive here. Just ~remove the class~ replace the poller_class = line with pass and be happy.

This workaround did not work for me, when running on 50+ hosts the playbook just "freeze" , revert to ansible 2.12.7 also did not work at the moment, I have to pursue investigations

jbg-sc avatar Jul 19 '23 09:07 jbg-sc

I'm not sure about all details here, but ansible-mitogen uses CPU pinning onto first two CPUs (you can see it when you run ansible with mitogen_linear with big number of hosts and forks, only first two CPUs are 100% busy).

The more hosts you have to run, the more congested those CPUs become, and everything slows down. That may explain 'freeze' behavior.

I've solved that problem by running deployment in parallel from multiple hosts (github actions) with --limit, where each runner runs playbook for a single host.

amarao avatar Jul 20 '23 09:07 amarao