mitogen icon indicating copy to clipboard operation
mitogen copied to clipboard

atexit wrapper orphans temp files with kubernetes.core.k8s ansible module

Open dmurnane opened this issue 1 month ago • 7 comments

Environment: Rocky Linux 9.6 python3-3.9.21-2.el9_6.2.x86_64 python3-mitogen-0.3.22-1.noarch ansible-core-2.14.18-1.el9.x86_64 python3-kubernetes-26.1.0-3.el9.noarch

The python kubernetes module creates temp files to hold certificates to communicate with the kubernetes API server. It registers an atexit handler for kubernetes.config.kube_config._cleanup_temp_files in order to remove those files on exit. It appears that the mitogen AtExitWrapper is masking that cleanup, so these files accumulate over time when kubernetes.core.k8s tasks are run. mitogen_task_isolation: fork does not resolve the issue, but switching to strategy: linear does properly clean up the files.

Patching _atexit__register in ansible_mitogen/runner.py with if func == shutil.rmtree or func == kubernetes.config.kube_config._cleanup_temp_files: appears to clean up the files but this feels rather more fragile than is reasonable.

Reproduction:

# code: language=ansible
---
- name: Testing
  hosts: localhost
  tasks:
    - name: Get coredns deployment
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: coredns
        namespace: kube-system
[dmurnane@localhost ~]$ ls /tmp/tmp*
ls: cannot access '/tmp/tmp*': No such file or directory
[dmurnane@localhost ~]$ ansible-playbook k8s-test.yml

PLAY [Testing] ****************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************
ok: [localhost]

TASK [Get coredns deployment] *************************************************************************************************************************************************************************************
ok: [localhost]

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost                  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
[dmurnane@localhost ~]$ ls /tmp/tmp*
/tmp/tmpte9emays  /tmp/tmputdhak8k  /tmp/tmpwtgoybri

Update: retested with mitogen 0.3.31 and observed the same behavior.

dmurnane avatar Nov 13 '25 17:11 dmurnane

@dmurnane

  1. Was there a time that these files got cleaned up? I.e. has this only started occuring recently?
  2. What version of the Kubernetes collection are you running?

Questions/notes to self

  • Do temp files get created by controller (perhaps in action), or by target (module)? Both are on localhost here

  • Is tests/ansible/integration/runner/atexit.yml running?

  • Design opportunity? How might Ansible/Mitogen enable this cleanup without assuming atexit is the mechanism?

  • Semantics difference?

    After all exit handlers have had a chance to run, the last exception to be raised is re-raised. https://docs.python.org/3/library/atexit.html#atexit.register

    vs ansible_mitogen.runnuner.AtExtiWrapper.run_callbacks() appears to log then swallow all exceptions

moreati avatar Nov 20 '25 15:11 moreati

  1. I think this has been happening as long as we've been using the kubernetes module, but there were some other environment changes around the time we started using the kubernetes module that made it a little hard to pin down once we noticed it.
  2. kubernetes collection is 4.0.0 since we're on ansible 2.14 and 5.0.0+ require 2.15 or newer, though have not actually tried newer. kubernetes.core.k8s and kubernetes.core.k8s_info have the same behavior, the helm* modules do not because they don't actually use the python kubernetes library.

Flow in the module appears to be:

  1. k8s_info module calls get_api_client https://github.com/ansible-collections/kubernetes.core/blob/main/plugins/modules/k8s_info.py#L222
  2. get_api_client calls kubernetes.config.load_kube_config https://github.com/ansible-collections/kubernetes.core/blob/main/plugins/module_utils/k8s/client.py#L118
  3. load_kube_config saves the cert objects in temp files and registers an atexit handler https://github.com/kubernetes-client/python/blob/master/kubernetes/base/config/kube_config.py#L72

I ran some tests with multiple hosts, results below:

Run with multiple hosts:

---
- name: Testing
  hosts: all
  tasks:
    - name: Get coredns deployment
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: coredns
        namespace: kube-system
[dmurnane@host1 ~]$ ansible-playbook test.yaml

PLAY [Testing] ***********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]

TASK [Get coredns deployment] ********************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
host1.local  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
host2.local  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host1.local | CHANGED | rc=0 >>
/tmp/tmphlqn0ofy
/tmp/tmpofeqcrjv
/tmp/tmps2qk05dh
host2.local | CHANGED | rc=0 >>
/tmp/tmp_dtmd9vn
/tmp/tmpi1_wz3g6
/tmp/tmpndcjjo96

Multiple hosts, delegated to the remote node:

---
- name: Testing
  hosts: all
  tasks:
    - name: Get coredns deployment
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: coredns
        namespace: kube-system
      delegate_to: "{{ groups['all'][1] }}"
[dmurnane@host1 ~]$ ansible-playbook test.yaml

PLAY [Testing] ***********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]

TASK [Get coredns deployment] ********************************************************************************************************************************************************************************************************************
ok: [host1.local -> host2.local]
ok: [host2.local]
[WARNING]: Removed restricted key from module data: ansible_facts

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
host1.local  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
host2.local  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host2.local | CHANGED | rc=0 >>
/tmp/tmpjj9_z9ch
/tmp/tmpxl7p64dq
/tmp/tmpxt09hg5b
host1.local | FAILED | rc=2 >>
ls: cannot access '/tmp/tmp*': No such file or directorynon-zero return code

Multiple hosts, delegated to the remote node, run_once:

---
- name: Testing
  hosts: all
  tasks:
    - name: Get coredns deployment
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: coredns
        namespace: kube-system
      delegate_to: "{{ groups['all'][1] }}"
      run_once: true
[dmurnane@host1 ~]$ ansible-playbook test.yaml

PLAY [Testing] ***********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************
ok: [host1.local]
ok: [host2.local]

TASK [Get coredns deployment] ********************************************************************************************************************************************************************************************************************
ok: [host1.local -> host2.local]
[WARNING]: Removed restricted key from module data: ansible_facts

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
host1.local  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
host2.local  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host2.local | CHANGED | rc=0 >>
/tmp/tmp3io_zarw
/tmp/tmpmyvvqnj3
/tmp/tmppyz9iruo
host1.local | FAILED | rc=2 >>
ls: cannot access '/tmp/tmp*': No such file or directorynon-zero return code

dmurnane avatar Nov 20 '25 16:11 dmurnane

  • Design opportunity? How might Ansible/Mitogen enable this cleanup without assuming atexit is the mechanism?

Ansible may already have this in Module.add_cleanup_file() and Module.cleanup().

Available since Ansible 1.7 https://github.com/ansible/ansible/commit/df877f2e79f3b5ddceb84dea6ee0dcd881e7c830

moreati avatar Nov 24 '25 10:11 moreati

Patching _atexit__register in ansible_mitogen/runner.py with if func == shutil.rmtree or func == kubernetes.config.kube_config._cleanup_temp_files: appears to clean up the files but this feels rather more fragile than is reasonable.

kubernetes.config.kube_config._cleanup_temp_files is in the kubernetes-client Python library, not the Kubernetes.Core Ansible collection.

The atexit.register() is a few lines down, in kubernetes.config.kube_config._create_temp_file_with_content().

moreati avatar Nov 24 '25 10:11 moreati

Problem has some deisgn/constraint parallels with #1255, #1333 and #1342

  1. Should there be an allow list? A deny list?
  2. Should the list(s) be configurable? By API? By ...?

moreati avatar Nov 24 '25 10:11 moreati

Speculating:

  • Does Mitogen need to persist the process that the Ansible module gets run in? Why?
  • Could Mitogen run the module in a per-task subprocess, spawned by a Mitogen child on the target that persists across tasks?

moreati avatar Dec 03 '25 10:12 moreati

Multiple hosts, delegated to the remote node, run_once: ...

[dmurnane@host1 ~]$ ansible-playbook test.yaml
...

[dmurnane@host1 ~]$ ansible all -m shell -a 'ls /tmp/tmp*'
host2.local | CHANGED | rc=0 >>
/tmp/tmp3io_zarw
/tmp/tmpmyvvqnj3
/tmp/tmppyz9iruo
host1.local | FAILED | rc=2 >>
ls: cannot access '/tmp/tmp*': No such file or directorynon-zero return code

There may be a more fundamental bug here. The shell invocation to look for lingering files isn't down as part of the playbook run, it's an entriely new invocation of ansible. Any processes from playbook tasks should be long gone, and their atexit handlers should have exited, regardless of Mitogen's different process liftetimes on the target. The files are still there, so the atexit handlers weren't run - even at the end of the playbook, or failed during execution.

moreati avatar Dec 03 '25 12:12 moreati