kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

Race condition for kubeadm_certificate_key when you have more than one master

Open DirkTheDaring opened this issue 1 year ago • 14 comments

The kubeadm_certificate_key variable contains another resolve pattern:

kubeadm_certificate_key: {{ lookup('password', credentials_dir + '/kubeadm_certificate_key.creds length=64 chars=hexdigits') | lower }}

if you have MORE than one master/control-plane, then this creates a race condition where the lookup tries to lock the "kubeadm_certificate_key.creds", which leads to an Exception. Fixed in my environment since 8 month with patch using "run_once", running since 8 month with it. Patch is at the end of this report

Environment: on-premise, machine run in a kvm.

OS: Fedora CoreOs - always latest

  • Version of ansible: 2.14.4

  • Version of Python 3.11.4

Kubespray version (commit) 717183ecc

 PATH: "{{ bin_dir }}:{{ ansible_env.PATH }}"

notify: Master | restart kubelet

+  # The following variable contains another resolve pattern
+  # kubeadm_certificate_key:  {{ lookup('password', credentials_dir + '/kubeadm_certificate_key.creds length=64 chars=hexdigits') | lower }}
+  # if you have MORE than one master/control-plane, then this creates a race condition
+  # where the lookup tries to lock the "kubeadm_certificate_key.creds", which leads to
+  # an exception.
+  #
+  # Avoid the race condition by running the resolve only once before entering the step "set kubeadm"
+  # because this is a lookup to a static filename, which does not change
+
+- name: set kubeadm_certificate_key_resolved to avoid race condition
+  set_fact:
+    kubeadm_certificate_key_resolved: "{{ kubeadm_certificate_key }}"
+  run_once: true
+
 - name: Set kubeadm certificate key
   set_fact:
     kubeadm_certificate_key: "{{ item | regex_search('--certificate-key ([^ ]+)', '\\1') | first }}"
   with_items: "{{ hostvars[groups['kube_control_plane'][0]]['kubeadm_init'].stdout_lines | default([]) }}"
   when:
-    - kubeadm_certificate_key is not defined
+    - kubeadm_certificate_key_resolved|length == 0
     - (item | trim) is match('.*--certificate-key.*')

 - name: Create hardcoded kubeadm token for joining nodes with 24h expiration (if defined)

DirkTheDaring avatar Jul 28 '23 14:07 DirkTheDaring

Thanks @DirkTheDaring very much.

Would you please upload the error message of the exception? And If it's possible, it's welcome to provide a PR for the issue :-)

yankay avatar Aug 02 '23 03:08 yankay

It actually solves https://github.com/kubernetes-sigs/kubespray/issues/9916 ... and not the workaround which was provided there. What the function call in ansible does try tor create a lockfile. Instead of following a classical approach (in the ansible python code), which would mean to wait until the lock is released, the ansible function runs into an exception (which actually defeats the intention of a lock, right?). So if e.g. 3 masters have to be created, the first goes through (takes lock) while the second or third (depends on "who is first" ) throws the exception. Took me over a year to analyze this, as it happens randomly -sometimes due to delays it simply works - then another time it fails on every run - very nasty. Why not fix the ansible code?. There seem to be other dependencies there.... so fix in ansible might even break more there. So the actual fix in kubespray is not run this 3 times (== 3 masters) in parallel, as lookup is not changing for any of those runs . See fix above

The error message is exactly this (and the FileExists come from an existing lockfile): An unhandled exception occurred while templating '{{ lookup('password', credentials_dir + '/kubeadm_certificate_key.creds length=64 chars=hexdigits') | lower }}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'password'. Error was a <class 'FileExistsError'>

DirkTheDaring avatar Aug 02 '23 06:08 DirkTheDaring

Hello, same issue for me.

smutel avatar Oct 09 '23 08:10 smutel

The PR is working fine.

smutel avatar Oct 09 '23 08:10 smutel

This looks like a bug in ansible rather than in kubespray, which, additionally, has already been solved : https://github.com/ansible/ansible/commit/0971a342d85b77bc0aa54c73fd0f8cccc0d6b1f0

The fix in ansible seems ok to me, so it's weird you're hitting this at all... :thinking:

Do you have a setup where you can reliably reproduce it ? I'm not able to.

VannTen avatar Nov 06 '23 16:11 VannTen

This fix is from 2018, I am hitting this problem at least 3 years. Tried everything - i can trigger this in 70% of all cases. The damn thing about a race condition it isn' easy to spot without spending a lot of time, which I will not put into this anymore (wasted enough with it)

DirkTheDaring avatar Nov 08 '23 23:11 DirkTheDaring

This fix is from 2018

Yeah, I know. What I mean is that it look like an Ansible regression. Ideally we'd have a fix in Ansible an a workaround in kubespray until the fix makes its way into a released version.

VannTen avatar Nov 09 '23 08:11 VannTen

my patch is the workaround. So how do want to continue.

DirkTheDaring avatar Nov 10 '23 15:11 DirkTheDaring

Looks like the regression could have been introduced by https://github.com/ansible/ansible/commit/0fd88717c953b92ed8a50495d55e630eb5d59166. I'll file an issue in Ansible for this, as it appears there isn't one yet.

evan-a-a avatar Jan 21 '24 18:01 evan-a-a

If you can, please link it here once it's filled, so we can track it and remove the eventual workaround when it's fixed and we upgrade to a version of ansible with the fix. :pray:

VannTen avatar Jan 22 '24 13:01 VannTen

I tried to reproduce this issue on 10 consecutive install and never got this error. Using VM instances from Outscale, Rocky 9.3, kubespray 2.24.1, Ansible 8.5.0 (core 2.15.9), Python 3.9.18, vars from sample inventory, 3 control plane nodes and 1 node.

From what I gathered of this, first there was issue #9916, which seems to be the original one. Then this issue was opened and a fix was proposed by the author in two PRs (#10394 closed in favor of #10523).

Since this issue seems to be pretty random and the proposed fix doesn’t deeply change anything (imo) but does solve the issue for some users, shouldn’t we move forward and merge #10523 with an additional FIXME note, as proposed in this comment?

nicolas-goudry avatar Mar 29 '24 16:03 nicolas-goudry

Not opposed to it, but I would prefer that there is an upstream issue we could point to in the FIXME. Otherwise, context get lost very quickly, and you never know whether removing workaround is safe enough.

VannTen avatar Mar 29 '24 17:03 VannTen

Hello,

I can confirm I have encountered this bug "reliably" (ie: each time I try to run kubespray to install a multimaster cluster) I confirm that the patch works and fixes the issue.

I have only tested on the same set of machines (6 VMs) but on this set of machines I have a 100% rate of failure without the patch and a 100% rate of success with the patch. I have tried something like 15 times without and 5 times with. I have reset the machines to zero at least 5 times to replay the whole sequence from scratch after a successful install to see if it was a random event. Does not seem random on my setup.

Versions:

  • kubespray on #4baa2c870 (tried different cset with the same result)
  • ansible [core 2.16.5]
  • python version = 3.10.12
  • jinja version = 3.1.2

Hope this info helps somewhat. Cheers, Florent

faide avatar May 06 '24 09:05 faide

More info: on the 6 machines I asked for 3 masters and 3 workers each time

faide avatar May 06 '24 09:05 faide