mitogen
mitogen copied to clipboard
Ansible 6 - latest commit 8cda5f5
-
Which version of Ansible are you running? ansible 6.2.0 ansible-core 2.13.3 Yes, I like living on the edge!
-
Is your version of Ansible patched in any way? No.
-
Are you running with any custom modules, or
module_utils
loaded? I don't think so! -
Have you tried the latest master version from Git? Yes, tried w/ 8cda5f5.
-
Do you have some idea of what the underlying problem may be? No.
-
Mention your host and target OS and versions Host - Fedora 36 Target - CentOS Stream 8
-
Mention your host and target Python versions Host - Python 3.10.6 Target - Python 3.6.8
-
If reporting a performance issue, mention the number of targets and a rough description of your workload (lots of copies, lots of tiny file edits, etc.)
12 hosts:
- 2 RDS instances - connecting w/
ansible_connection: local
- 8 very old MySQL 5.1 instances
- 2 MariaDB Galera Cluster servers (I'm only doing work on one server) On each host I maintain about 120 users and make sure about 60 users do not exist (they left the company). Module used is: community.mysql.mysql_user
- If reporting a crash or hang in Ansible, please rerun with -vvv and include 200 lines of output around the point of the error
[task 147682] 14:41:25.290745 D ansible_mitogen.process: will use multiplexer 0 (/tmp/mitogen_unix_w41ay0kt.sock) to connect to "services-mariadb-cluster-01.***"
[task 147682] 14:41:25.291051 D mitogen.unix: client: connecting to /tmp/mitogen_unix_w41ay0kt.sock
[task 147682] 14:41:25.291492 D mitogen.unix: client: local ID is 22408, remote is 0
[mux 146141] 14:41:25.291518 D mitogen.unix: listener: accepted connection from PID 147682: unix_client.147682
ERROR! [task 147682] 14:41:25.292316 E mitogen: broker crashed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 3588, in _do_broker_main
self._loop_once()
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 3543, in _loop_once
for side, func in self.poller.poll(timeout):
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 2465, in _poll
(rfds, wfds, _), _ = io_op(select.select,
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 567, in io_op
return func(*args), None
ValueError: filedescriptor out of range in select()
[task 147682] 14:41:25.293974 D mitogen: Broker(0f40): force disconnecting <Side of default fd 941>
[task 147682] 14:41:25.294198 D mitogen: Waker(fd=941/1032): disconnecting
[task 147682] 14:41:25.294686 D mitogen: Broker(0f40): force disconnecting <Side of unix_listener.146141 fd 1033>
[task 147682] 14:41:25.294942 D ansible_mitogen.mixins: _remove_tmp_path(None)
[task 147682] 14:41:25.296282 D mitogen: MitogenProtocol(unix_listener.146141): disconnecting
[task 147682] 14:41:25.296672 D mitogen: Router(Broker(0f40)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
[mux 146141] 14:41:25.296708 D mitogen: <Side of unix_client.147682 fd 79>: disconnected during read: [Errno 104] Connection reset by peer
[mux 146141] 14:41:25.296955 D mitogen: <Side of unix_client.147682 fd 79>: empty read, disconnecting
[mux 146141] 14:41:25.297294 D mitogen: MitogenProtocol(unix_client.147682): disconnecting
The full traceback is:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/ansible/executor/task_executor.py", line 118, in run
item_results = self._run_loop(items)
File "/usr/local/lib/python3.10/site-packages/ansible/executor/task_executor.py", line 353, in _run_loop
res = self._execute(variables=task_vars)
File "/usr/local/lib/python3.10/site-packages/ansible/executor/task_executor.py", line 635, in _execute
result = self._handler.run(task_vars=vars_copy)
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/mixins.py", line 146, in run
return super(ActionModuleMixin, self).run(tmp, task_vars)
File "/usr/local/lib/python3.10/site-packages/ansible/plugins/action/normal.py", line 47, in run
result = merge_hash(result, self._execute_module(task_vars=task_vars, wrap_async=wrap_async))
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/mixins.py", line 376, in _execute_module
self._set_temp_file_args(module_args, wrap_async)
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/mixins.py", line 355, in _set_temp_file_args
self._connection.get_good_temp_dir()
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/connection.py", line 832, in get_good_temp_dir
self._connect()
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/connection.py", line 854, in _connect
self._connect_stack(stack)
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/connection.py", line 801, in _connect_stack
dct = mitogen.service.call(
File "/usr/local/lib/python3.10/site-packages/mitogen/service.py", line 126, in call
return call_context.call_service(service_name, method_name, **kwargs)
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 2314, in call_service
return recv.get().unpickle()
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 1195, in get
msg._throw_dead()
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 935, in _throw_dead
raise ChannelError(self.data.decode('utf-8', 'replace'))
mitogen.core.ChannelError: the respondent Context has disconnected
fatal: [services-mariadb-cluster-01.***]: FAILED! => {
"msg": "Unexpected failure during module execution.",
"stdout": ""
}
[task 147683] 14:41:25.310180 D ansible_mitogen.process: will use multiplexer 0 (/tmp/mitogen_unix_w41ay0kt.sock) to connect to "payments-mariadb-cluster-01.***"
[task 147683] 14:41:25.310438 D mitogen.unix: client: connecting to /tmp/mitogen_unix_w41ay0kt.sock
[task 147683] 14:41:25.310824 D mitogen.unix: client: local ID is 22409, remote is 0
[mux 146141] 14:41:25.310849 D mitogen.unix: listener: accepted connection from PID 147683: unix_client.147683
ERROR! [task 147683] 14:41:25.311652 E mitogen: broker crashed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 3588, in _do_broker_main
self._loop_once()
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 3543, in _loop_once
for side, func in self.poller.poll(timeout):
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 2465, in _poll
(rfds, wfds, _), _ = io_op(select.select,
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 567, in io_op
return func(*args), None
ValueError: filedescriptor out of range in select()
[task 147683] 14:41:25.312819 D mitogen: Broker(0f40): force disconnecting <Side of default fd 940>
[task 147683] 14:41:25.312971 D mitogen: Waker(fd=940/1032): disconnecting
[task 147683] 14:41:25.313491 D mitogen: Broker(0f40): force disconnecting <Side of unix_listener.146141 fd 1033>
[task 147683] 14:41:25.313749 D ansible_mitogen.mixins: _remove_tmp_path(None)
[task 147683] 14:41:25.315459 D mitogen: MitogenProtocol(unix_listener.146141): disconnecting
[task 147683] 14:41:25.315938 D mitogen: Router(Broker(0f40)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
[mux 146141] 14:41:25.315946 D mitogen: <Side of unix_client.147683 fd 79>: disconnected during read: [Errno 104] Connection reset by peer
[mux 146141] 14:41:25.316169 D mitogen: <Side of unix_client.147683 fd 79>: empty read, disconnecting
[mux 146141] 14:41:25.316507 D mitogen: MitogenProtocol(unix_client.147683): disconnecting
The full traceback is:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/ansible/executor/task_executor.py", line 118, in run
item_results = self._run_loop(items)
File "/usr/local/lib/python3.10/site-packages/ansible/executor/task_executor.py", line 353, in _run_loop
res = self._execute(variables=task_vars)
File "/usr/local/lib/python3.10/site-packages/ansible/executor/task_executor.py", line 635, in _execute
result = self._handler.run(task_vars=vars_copy)
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/mixins.py", line 146, in run
return super(ActionModuleMixin, self).run(tmp, task_vars)
File "/usr/local/lib/python3.10/site-packages/ansible/plugins/action/normal.py", line 47, in run
result = merge_hash(result, self._execute_module(task_vars=task_vars, wrap_async=wrap_async))
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/mixins.py", line 376, in _execute_module
self._set_temp_file_args(module_args, wrap_async)
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/mixins.py", line 355, in _set_temp_file_args
self._connection.get_good_temp_dir()
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/connection.py", line 832, in get_good_temp_dir
self._connect()
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/connection.py", line 854, in _connect
self._connect_stack(stack)
File "/usr/local/lib/python3.10/site-packages/ansible_mitogen/connection.py", line 801, in _connect_stack
dct = mitogen.service.call(
File "/usr/local/lib/python3.10/site-packages/mitogen/service.py", line 126, in call
return call_context.call_service(service_name, method_name, **kwargs)
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 2314, in call_service
return recv.get().unpickle()
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 1195, in get
msg._throw_dead()
File "/usr/local/lib/python3.10/site-packages/mitogen/core.py", line 935, in _throw_dead
raise ChannelError(self.data.decode('utf-8', 'replace'))
mitogen.core.ChannelError: the respondent Context has disconnected
fatal: [payments-mariadb-cluster-01.***]: FAILED! => {
"msg": "Unexpected failure during module execution.",
"stdout": ""
}
- If reporting any kind of problem with Ansible, please include the Ansible version along with output of "ansible-config dump --only-changed".
CACHE_PLUGIN(***/infrastructure/ansible.cfg) = jsonfile
CACHE_PLUGIN_CONNECTION(***/infrastructure/ansible.cfg) = ~/.ansible/cache
CACHE_PLUGIN_TIMEOUT(***/infrastructure/ansible.cfg) = 3600
CALLBACKS_ENABLED(***/infrastructure/ansible.cfg) = ['timer', 'profile_tasks', 'profile_roles']
DEFAULT_ACTION_PLUGIN_PATH(env: ANSIBLE_ACTION_PLUGINS) = ['/usr/local/lib/python3.9/site-packages/ara/plugins/action']
DEFAULT_ASK_PASS(***/infrastructure/ansible.cfg) = False
DEFAULT_CALLBACK_PLUGIN_PATH(env: ANSIBLE_CALLBACK_PLUGINS) = ['/usr/local/lib/python3.9/site-packages/ara/plugins/callback']
DEFAULT_EXECUTABLE(***/infrastructure/ansible.cfg) = /bin/bash
DEFAULT_FORCE_HANDLERS(***/infrastructure/ansible.cfg) = True
DEFAULT_FORKS(***/infrastructure/ansible.cfg) = 15
DEFAULT_GATHERING(***/infrastructure/ansible.cfg) = smart
DEFAULT_GATHER_SUBSET(***/infrastructure/ansible.cfg) = ['all']
DEFAULT_HOST_LIST(***/infrastructure/ansible.cfg) = ['***/infrastructure/envs']
DEFAULT_LOG_PATH(***/infrastructure/ansible.cfg) = ~/.ansible/ansible.log
DEFAULT_LOOKUP_PLUGIN_PATH(env: ANSIBLE_LOOKUP_PLUGINS) = ['/usr/local/lib/python3.9/site-packages/ara/plugins/lookup']
DEFAULT_MANAGED_STR(***/infrastructure/ansible.cfg) = Ansible managed! DON'T CHANGE THIS FILE BY HAND! You were warned!
DEFAULT_ROLES_PATH(***/infrastructure/ansible.cfg) = ['***/infrastructure/roles']
DEFAULT_STRATEGY(env: ANSIBLE_STRATEGY) = mitogen_linear
DEFAULT_STRATEGY_PLUGIN_PATH(env: ANSIBLE_STRATEGY_PLUGINS) = ['/usr/local/lib/python3.10/site-packages/ansible_mitogen']
DEFAULT_TIMEOUT(***/infrastructure/ansible.cfg) = 30
DEPRECATION_WARNINGS(***/infrastructure/ansible.cfg) = True
HOST_KEY_CHECKING(***/infrastructure/ansible.cfg) = False
INVENTORY_ENABLED(***/infrastructure/ansible.cfg) = ['yaml', 'aws_ec2', 'ini']
RETRY_FILES_ENABLED(***/infrastructure/ansible.cfg) = False
SHOW_CUSTOM_STATS(***/infrastructure/ansible.cfg) = True
Would really love this as mitogen
is lowering my playbook time from over 600 seconds to under 120 seconds!
Facing exactly the same issue here.
I did some research and I am not 100% sure but it seems this probably was faced by ansible
and ncclient
as well:
https://github.com/ncclient/ncclient/issues/254
It seems mitogen
is using select
which other libraries migrated away from;
https://github.com/ncclient/ncclient/pull/260
Change from using the current select.select to test for file descriptor (socket/channel) activity to using selectors.DefaultSelector or selectors2.DefaultSelector on Python 2.7. The concrete selctor are the KqueueSelector on macOS and the EpollSelector on linux (Ubuntu 16.0.4) for both Python 2.7 and Python 3.7.
https://github.com/ansible/ansible/pull/20934
Select suffers from many problems including running out of file descriptors with no means to tell ahead of time. Use a backport (bundled if necessary) version of the selectors module from python3.5 in order to use EPoll, Kqueue, or other, more modern OS feature for polling if it's available.
I did some more research and also tried the mitogen_task_isolation: fork
option, but that does not change anything.
Got into this error again.. Turns out right before the process goes dead, there are a lot of open file descriptors for facts that are removed in the mean time: (NOTE: I did this just before the process quite, it already has 985 file desciptors open at that point)
ansible-p 145367 jorrick 976u REG 253,0 263358 3673368 ansible/.facts/tmpxms6crx6 (deleted)
ansible-p 145367 jorrick 977u REG 253,0 287974 3673369 ansible/.facts/tmp1pua4fp7 (deleted)
ansible-p 145367 jorrick 978u REG 253,0 287974 3673372 ansible/.facts/tmplsyw30qe (deleted)
ansible-p 145367 jorrick 979u REG 253,0 263358 3673374 ansible/.facts/tmphsfrxo4v (deleted)
ansible-p 145367 jorrick 980u REG 253,0 287974 3673375 ansible/.facts/tmpafdivups (deleted)
ansible-p 145367 jorrick 981u REG 253,0 275365 3673376 ansible/.facts/tmpx9vxqcb4 (deleted)
ansible-p 145367 jorrick 983u REG 253,0 275365 3673382 ansible/.facts/tmpnaai0c_k (deleted)
ansible-p 145367 jorrick 984u REG 253,0 263358 3673380 ansible/.facts/tmpome7lo_z (deleted)
ansible-p 145367 jorrick 985u REG 253,0 263358 3673377 ansible/.facts/tmpv6pqxrp2 (deleted)
As expected, this indeed hits the 1024 limit that I mentioned above due to the way select
is implemented in this package.
Turns out, we had the following caching configuration is ansible.cfg
:
fact_caching = jsonfile
fact_caching_connection = ./.facts
I changed this to the following, after which it worked fine.
fact_caching = memory
;fact_caching_connection = ./.facts
I will check in the jsonfile
whether there is some way we can tune the open file descriptors.
Notes to self
-
select
: stdlib module,-
select.select()
: near universally available on POSIX-likes,O(max(fds))
, cannot use anyfd > FD_SETSIZE
a compile time C #define usually == 1024 -
select.poll()
: broadly available,O(len(fds))
-
select.epoll()
: Linux only, Python 2.6+,O(1)
IIF using edge triggered, otherwiseO(len(fds))
-
select.kqueue()
: BSD (and derivates) only, Python 2.6+
-
-
selectors
: stdlib module, Python 3.4+, high level abstractions overselect.*
functions-
selectors._can_use()
check for determiningselectors.DefaultSelector
is more thorough than Mitogens...SUPPORTED
checks based ongetattr()
-
-
selectors2
: backport ofselectors
, Python 2.6+, archived, last release 2020 -
Mitogen Poller() variants
-
mitogen.core.Poller()
: Usesselect.select()
-
mitogen.parent.PollPoller()
: Usesselect.poll()
-
mitogen.parent.KqueuePoller()
: Usesselect.kqueue()
-
mitogen.parent.EpollPoller()
: Usesselect.epoll()
-
mitogen.*.*Poller.SUPPORTED
: class attribute, true if OS/Python version supports it. -
mitogen.parent.PREFERRED_POLLER
: best availablePoller()
by hard coded heuristic, measured at import time.
-
-
Mitogen
poller_class
uses-
mitogen.core.Broker.poller_class
: defaultmitogen.core.Poller()
, not replaced -
mitogen.core.Latch.poller_class
: defaultmitogen.core.Poller()
replaced during import ofmitogen.parent
-
mitogen.master.Broker.poller_class
: defaultmitogen.parent.PREFERRED_POLLER
-
ansible_mitogen.process.Broker.poller_class
: defaultmitogen.core.Poller()
-
mitogen.parent._upgrade_broker()
usesPREFERRED_POLLER
directly`
-
-
mitogen.select.Select()
: conceptually similar toselect.select()
et al, but acts over Mitogen concepts not file descriptors. Uses amitogen.core.Latch
internally, hence one of thePoller()
implementations.
Probably duplicate of/duplicated by #967. Two PRs attempting to fix these: #984, #991.
I'm targetting this problem for Mitogen 0.3.7. I would like to check some things, to prevent it re-occuring and improve my understanding
- @eRadical @lberruti @philfry do you have a Minimal Reproducing Example, or a playbook that could be a starting point? I would like to have a regression test along with any fix.
- @lberruti your PR #984 uses
mitogen.parent.PollPoller
as the replacement, withmitogen.core.Poller
as fallback. Did you have reasons to chose this one? As opposed tomitogen.parent.PREFERRED_POLLER
. - @philfry your PR #991 removes any override, effectively using
mitogen.parent.PREFERRED_POLLER
. Did you pick that for specific reasons?
Sorry for the long wait. Thank you each for investigating in the mean time.
1. @eRadical @lberruti @philfry do you have a Minimal Reproducing Example, or a playbook that could be a starting point? I would like to have a regression test along with any fix.
Unfortunately not. This happened on larger playbooks on many machines being deployed at once.
3. @philfry your PR [process/Broker: do not reset poller_class. Fixes #967 #991](https://github.com/mitogen-hq/mitogen/pull/991) removes any override, effectively using `mitogen.parent.PREFERRED_POLLER`. Did you pick that for specific reasons?
This was because a limit of 1024 fds, see https://github.com/mitogen-hq/mitogen/issues/967#issuecomment-1482640922
All I can say is that this workaround worked for my collegues and me since that PR with all ansible versions from 2.12.8 to 2.16.4 :smiley:
1. @eRadical @lberruti @philfry do you have a Minimal Reproducing Example, or a playbook that could be a starting point? I would like to have a regression test along with any fix.
unfortunately not, sorry
2. @lberruti your PR [Fix: filedescriptor out of range in select() #984](https://github.com/mitogen-hq/mitogen/pull/984) uses `mitogen.parent.PollPoller` as the replacement, with `mitogen.core.Poller` as fallback. Did you have reasons to chose this one? As opposed to `mitogen.parent.PREFERRED_POLLER`.
You're right: I think it's better to use mitogen.parent.PREFERRED_POLLER as the replacement
Was able reproduce this with 40 hosts with ansible-core 2.15.10/ansible 8.7.0 and ansible-core 2.16.6/ansible 9.4.0
TASK [common : Discard ansible cache] *******************************************************************************************************************************************
ok: [adp-srv01.arakis.hk]
ok: [u.arakis.hk]
ok: [h.arakis.hk]
ERROR! [task 27755] 11:40:31.374736 E mitogen: broker crashed
Traceback (most recent call last):
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 3758, in _do_broker_main
self._loop_once()
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 3713, in _loop_once
for side, func in self.poller.poll(timeout):
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 2635, in _poll
(rfds, wfds, _), _ = io_op(select.select,
^^^^^^^^^^^^^^^^^^^^
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 599, in io_op
return func(*args), None
^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
ok: [adv-dev-shared-srv01.arakis.hk]
ERROR! [task 27759] 11:40:31.726027 E mitogen: broker crashed
Traceback (most recent call last):
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 3758, in _do_broker_main
self._loop_once()
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 3713, in _loop_once
for side, func in self.poller.poll(timeout):
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 2635, in _poll
(rfds, wfds, _), _ = io_op(select.select,
^^^^^^^^^^^^^^^^^^^^
File "/home/domas/.local/lib64/python3.12/site-packages/mitogen/core.py", line 599, in io_op
return func(*args), None
^^^^^^^^^^^
This patch works with ansible 2.15.10 and 2.16.5 https://github.com/philfry/mitogen/commit/a1af885244eb685ca43e30e7cbe76e0b846e2c0b
Just to add that I locally applied https://github.com/mitogen-hq/mitogen/pull/984/files and so far so good on my staging/test instance
https://github.com/mitogen-hq/mitogen/blob/3f7a0f74a5d4102fd4d6560a6e12fa445eb1aff3/mitogen/parent.py#L902-L906
Digging through old tickets and commits PollPoller is slightly better for "small FD counts" (I don't know how many counts as small), and EpollPoller or KqueuePoller are better at large FD counts (again, I don't have a number).
On Mitogen 0.2.x and Python 2.4/2.5 PREFERRED_POLLER can only be PollPoller or Poller. Python 2.7 shipped with older OSX (10.4) also ended up PREFERRED_POLLER==Poller.
On Mitogen 0.3.x (minimum Python 2.7) in 2024 PREFERRED_POLLER will probably always be EpollPoller or KqueuePoller.
PREFERRED_POLLER is slight misnomer. It's more like most feature-full poller available.
Report of a Python 3.8 on mac OS (version unspecified) lacking select.poll()
https://github.com/ClearcodeHQ/mirakuru/issues/373
Works for me! Thank you for your work @moreati!