atexit wrapper orphans temp files with kubernetes.core.k8s ansible module
Environment: Rocky Linux 9.6 python3-3.9.21-2.el9_6.2.x86_64 python3-mitogen-0.3.22-1.noarch ansible-core-2.14.18-1.el9.x86_64 python3-kubernetes-26.1.0-3.el9.noarch
The python kubernetes module creates temp files to hold certificates to communicate with the kubernetes API server. It registers an atexit handler for kubernetes.config.kube_config._cleanup_temp_files in order to remove those files on exit. It appears that the mitogen AtExitWrapper is masking that cleanup, so these files accumulate over time when kubernetes.core.k8s tasks are run. mitogen_task_isolation: fork does not resolve the issue, but switching to strategy: linear does properly clean up the files.
Patching _atexit__register in ansible_mitogen/runner.py with if func == shutil.rmtree or func == kubernetes.config.kube_config._cleanup_temp_files: appears to clean up the files but this feels rather more fragile than is reasonable.
Reproduction:
# code: language=ansible
---
- name: Testing
hosts: localhost
tasks:
- name: Get coredns deployment
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: coredns
namespace: kube-system
[dmurnane@localhost ~]$ ls /tmp/tmp*
ls: cannot access '/tmp/tmp*': No such file or directory
[dmurnane@localhost ~]$ ansible-playbook k8s-test.yml
PLAY [Testing] ****************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************
ok: [localhost]
TASK [Get coredns deployment] *************************************************************************************************************************************************************************************
ok: [localhost]
PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[dmurnane@localhost ~]$ ls /tmp/tmp*
/tmp/tmpte9emays /tmp/tmputdhak8k /tmp/tmpwtgoybri
Update: retested with mitogen 0.3.31 and observed the same behavior.
@dmurnane
- Was there a time that these files got cleaned up? I.e. has this only started occuring recently?
- What version of the Kubernetes collection are you running?
Questions/notes to self
-
Do temp files get created by controller (perhaps in action), or by target (module)? Both are on localhost here
-
Is tests/ansible/integration/runner/atexit.yml running?
-
Design opportunity? How might Ansible/Mitogen enable this cleanup without assuming
atexitis the mechanism? -
Semantics difference?
After all exit handlers have had a chance to run, the last exception to be raised is re-raised. https://docs.python.org/3/library/atexit.html#atexit.register
vs
ansible_mitogen.runnuner.AtExtiWrapper.run_callbacks()appears to log then swallow all exceptions
- I think this has been happening as long as we've been using the kubernetes module, but there were some other environment changes around the time we started using the kubernetes module that made it a little hard to pin down once we noticed it.
- kubernetes collection is 4.0.0 since we're on ansible 2.14 and 5.0.0+ require 2.15 or newer, though have not actually tried newer. kubernetes.core.k8s and kubernetes.core.k8s_info have the same behavior, the helm* modules do not because they don't actually use the python kubernetes library.
Flow in the module appears to be:
- k8s_info module calls get_api_client https://github.com/ansible-collections/kubernetes.core/blob/main/plugins/modules/k8s_info.py#L222
- get_api_client calls kubernetes.config.load_kube_config https://github.com/ansible-collections/kubernetes.core/blob/main/plugins/module_utils/k8s/client.py#L118
- load_kube_config saves the cert objects in temp files and registers an atexit handler https://github.com/kubernetes-client/python/blob/master/kubernetes/base/config/kube_config.py#L72
I ran some tests with multiple hosts, results below:
Run with multiple hosts:
---
- name: Testing
hosts: all
tasks:
- name: Get coredns deployment
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: coredns
namespace: kube-system
[dmurnane@host1 ~]$ ansible-playbook test.yaml
PLAY [Testing] ***********************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]
TASK [Get coredns deployment] ********************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]
PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
host1.local : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
host2.local : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host1.local | CHANGED | rc=0 >>
/tmp/tmphlqn0ofy
/tmp/tmpofeqcrjv
/tmp/tmps2qk05dh
host2.local | CHANGED | rc=0 >>
/tmp/tmp_dtmd9vn
/tmp/tmpi1_wz3g6
/tmp/tmpndcjjo96
Multiple hosts, delegated to the remote node:
---
- name: Testing
hosts: all
tasks:
- name: Get coredns deployment
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: coredns
namespace: kube-system
delegate_to: "{{ groups['all'][1] }}"
[dmurnane@host1 ~]$ ansible-playbook test.yaml
PLAY [Testing] ***********************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]
TASK [Get coredns deployment] ********************************************************************************************************************************************************************************************************************
ok: [host1.local -> host2.local]
ok: [host2.local]
[WARNING]: Removed restricted key from module data: ansible_facts
PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
host1.local : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
host2.local : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host2.local | CHANGED | rc=0 >>
/tmp/tmpjj9_z9ch
/tmp/tmpxl7p64dq
/tmp/tmpxt09hg5b
host1.local | FAILED | rc=2 >>
ls: cannot access '/tmp/tmp*': No such file or directorynon-zero return code
Multiple hosts, delegated to the remote node, run_once:
---
- name: Testing
hosts: all
tasks:
- name: Get coredns deployment
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: coredns
namespace: kube-system
delegate_to: "{{ groups['all'][1] }}"
run_once: true
[dmurnane@host1 ~]$ ansible-playbook test.yaml
PLAY [Testing] ***********************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]
TASK [Get coredns deployment] ********************************************************************************************************************************************************************************************************************
ok: [host1.local -> host2.local]
[WARNING]: Removed restricted key from module data: ansible_facts
PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
host1.local : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
host2.local : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host2.local | CHANGED | rc=0 >>
/tmp/tmp3io_zarw
/tmp/tmpmyvvqnj3
/tmp/tmppyz9iruo
host1.local | FAILED | rc=2 >>
ls: cannot access '/tmp/tmp*': No such file or directorynon-zero return code
- Design opportunity? How might Ansible/Mitogen enable this cleanup without assuming
atexitis the mechanism?
Ansible may already have this in Module.add_cleanup_file() and Module.cleanup().
Available since Ansible 1.7 https://github.com/ansible/ansible/commit/df877f2e79f3b5ddceb84dea6ee0dcd881e7c830
Patching _atexit__register in ansible_mitogen/runner.py with
if func == shutil.rmtree or func == kubernetes.config.kube_config._cleanup_temp_files:appears to clean up the files but this feels rather more fragile than is reasonable.
kubernetes.config.kube_config._cleanup_temp_files is in the kubernetes-client Python library, not the Kubernetes.Core Ansible collection.
The atexit.register() is a few lines down, in kubernetes.config.kube_config._create_temp_file_with_content().
Problem has some deisgn/constraint parallels with #1255, #1333 and #1342
- Should there be an allow list? A deny list?
- Should the list(s) be configurable? By API? By ...?
Speculating:
- Does Mitogen need to persist the process that the Ansible module gets run in? Why?
- Could Mitogen run the module in a per-task subprocess, spawned by a Mitogen child on the target that persists across tasks?
Multiple hosts, delegated to the remote node, run_once: ...
[dmurnane@host1 ~]$ ansible-playbook test.yaml ... [dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*' host2.local | CHANGED | rc=0 >> /tmp/tmp3io_zarw /tmp/tmpmyvvqnj3 /tmp/tmppyz9iruo host1.local | FAILED | rc=2 >> ls: cannot access '/tmp/tmp*': No such file or directorynon-zero return code
There may be a more fundamental bug here. The shell invocation to look for lingering files isn't down as part of the playbook run, it's an entriely new invocation of ansible. Any processes from playbook tasks should be long gone, and their atexit handlers should have exited, regardless of Mitogen's different process liftetimes on the target. The files are still there, so the atexit handlers weren't run - even at the end of the playbook, or failed during execution.