kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

fix: use super-admin.conf for kube-vip on first CP

Open sathieu opened this issue 1 year ago • 12 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes: Fixes #11229

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

sathieu avatar May 28 '24 16:05 sathieu

Hi @sathieu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 28 '24 16:05 k8s-ci-robot

Hi @sathieu , I tried your PR and I get the following error. I think the reason is that "Kube-vip | Write static pod" is run before "Set fact first_kube_control_plane".

ansible.errors.AnsibleUndefinedVariable: 'first_kube_control_plane' is undefined. 'first_kube_control_plane' is undefined                                                            
fatal: [cp-1]: FAILED! => {                                                                                                                                                      
    "changed": false,                                                                                                                                                                
    "msg": "AnsibleUndefinedVariable: 'first_kube_control_plane' is undefined. 'first_kube_control_plane' is undefined"                                                              
}  

andrewfung729 avatar May 29 '24 05:05 andrewfung729

/ok-to-test

yankay avatar May 29 '24 09:05 yankay

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sathieu Once this PR has been reviewed and has the lgtm label, please assign floryut for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 30 '24 17:05 k8s-ci-robot

FYI, I tried the patch and hit the same issue of 'first_kube_control_plane' not defined. I did this "hack" and was able to get this to work in a bare-metal, off-prem configuration:

diff --git a/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml b/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
index f7b04a624..b5acdac8c 100644
--- a/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
+++ b/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
@@ -6,6 +6,10 @@
     - kube_proxy_mode == 'ipvs' and not kube_proxy_strict_arp
     - kube_vip_arp_enabled

+- name: Kube-vip | Check if first control plane
+  set_fact:
+    is_first_control_plane: "{{ inventory_hostname == groups['kube_control_plane'] | first }}"
+
 - name: Kube-vip | Write static pod
   template:
     src: manifests/kube-vip.manifest.j2
diff --git a/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2 b/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
index 11a971e93..7b59bca4c 100644
--- a/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
+++ b/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
@@ -119,6 +119,6 @@ spec:
   hostNetwork: true
   volumes:
   - hostPath:
-      path: /etc/kubernetes/admin.conf
+      path: /etc/kubernetes/{% if is_first_control_plane %}super-{% endif %}admin.conf
     name: kubeconfig
 status: {}

pmichali avatar Jun 01 '24 20:06 pmichali

FYI, I tried the patch and hit the same issue of 'first_kube_control_plane' not defined. I did this "hack" and was able to get this to work in a bare-metal, off-prem configuration:

You are not testing the last version, see https://github.com/kubernetes-sigs/kubespray/commit/0d8323789aba03e04dc8afaf6d06be9bdc87d600.

sathieu avatar Jun 03 '24 09:06 sathieu

FYI, I tried the patch and hit the same issue of 'first_kube_control_plane' not defined. I did this "hack" and was able to get this to work in a bare-metal, off-prem configuration:

You are not testing the last version, see 0d83237.

0d83237 is based off of 4b82e90dc. My diff is based off of the newer 351393e32 commit on master branch. As mentioned by @andrewfung729, first_kube_control_plane is not defined, when kube-vip.manifest.j2 is processed. My change defines the fact, right before kube-vip.manifest.j2 is processed. I just used another name - is_first_control_plane.

pmichali avatar Jun 03 '24 12:06 pmichali

CI is failing for probably unrelated reason, can someone take a look?

I think fixing this issue is easy, but adding proper test is harder (because we don't have an external IP available).

sathieu avatar Jun 03 '24 13:06 sathieu

Now failing with:

No such file or directory - qemu-img

(unrelated to this PR)

sathieu avatar Jun 06 '24 10:06 sathieu

I tried the fix on 2 cp cluster with kubespray v2.25.0 and it's failing for the second master:

TASK [kubernetes/control-plane : Wait for k8s apiserver] ***************************************************************************************** ok: [first_cp] fatal: [second_cp]: FAILED! => {"changed": false, "elapsed": 181, "msg": "Timeout when waiting for lb-apiserver.kubernetes.local:6443"}

Am I missing anything?

EDIT: sorry, was using the wrong interface..

It worked fine and the playbook finished successfully..

opethema avatar Jun 20 '24 16:06 opethema

@sathieu FYI, I tried the latest patch on a 3 control plane/4 worker node on-prem k8s 1.29.5 cluster using Kubespray (2 week old pull of master branch, post 2.25.0) and it worked fine! Thanks for working on this.

pmichali avatar Jun 21 '24 12:06 pmichali

I've removed the CI part as it doesn't work (we need a valid VIP).

Please review and merge 🙏

sathieu avatar Jun 27 '24 12:06 sathieu

Can someone review and hopefully merge this PR?

@yankay @ErikJiang @floryut @tu1h @KubeKyrie (some of the reviewers of past kube-vip PRs) 🙏

sathieu avatar Jul 03 '24 10:07 sathieu

Hi, For the first installation it works fine but when I wanna test out removal of first kube-controller node I change the order of the nodes in inventory file and rerun cluster.yaml. It ran without errors but kube-vip container in new first master start crashing with

panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

Problem was super-admin.conf gets created only for the first run on the first master, when i changed the order of masters new first master's kube-vip manifest looked for super-admin.conf too and failed. When I changed it back to admin.conf it started working again.

I added file check task prior to Kube-vip | Write static pod task and change template accordingly, now it works on first controller node inventory order changes too.

diff --git a/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml b/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
index 7e3471593..4e3b3478a 100644
--- a/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
+++ b/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
@@ -6,6 +6,13 @@
     - kube_proxy_mode == 'ipvs' and not kube_proxy_strict_arp
     - kube_vip_arp_enabled

+- name: Kube-vip | Check if super-admin.conf exists
+  stat:
+    path: "{{ kube_config_dir }}/super-admin.conf"
+  failed_when: false
+  changed_when: false
+  register: stat_kube_vip_super_admin
+
 - name: Kube-vip | Write static pod
   template:
     src: manifests/kube-vip.manifest.j2
diff --git a/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2 b/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
index 11a971e93..8276435a1 100644
--- a/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
+++ b/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
@@ -119,6 +119,6 @@ spec:
   hostNetwork: true
   volumes:
   - hostPath:
-      path: /etc/kubernetes/admin.conf
+      path: /etc/kubernetes/{% if inventory_hostname == (groups['kube_control_plane'] | first) and stat_kube_vip_super_admin.stat.exists and stat_kube_vip_super_admin.stat.isreg %}supe
r-{% endif %}admin.conf
     name: kubeconfig
 status: {}

I hope you can add this to PR before it is merged.

Seljuke avatar Aug 05 '24 16:08 Seljuke

I updated the PR inspired by @Seljuke comment https://github.com/kubernetes-sigs/kubespray/pull/11242#issuecomment-2269450109.

Can someone review it? This is a blocker for us ....

sathieu avatar Aug 06 '24 15:08 sathieu

@sathieu I found a bug in my solution. When running cluster setup first time on fresh nodes kube-vip.yml static pod manifests are created before running kubeadm, so super-admin.conf can't be found and if condition on jinja template always returns admin.conf. We gotta check if kubeadm run before and apply if logic based on that. Here is the solution covering this too;

diff --git a/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2 b/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
index 11a971e93..b34588552 100644
--- a/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
+++ b/roles/kubernetes/node/templates/manifests/kube-vip.manifest.j2
@@ -119,6 +119,6 @@ spec:
   hostNetwork: true
   volumes:
   - hostPath:
-      path: /etc/kubernetes/admin.conf
+      path: /etc/kubernetes/{% if inventory_hostname == (groups['kube_control_plane'] | first) and ((stat_kube_vip_super_admin.stat.exists and stat_kube_vip_super_admin.stat.isreg) or (not kubeadm_already_run.stat.exists )) %}super-{% endif %}admin.conf
     name: kubeconfig
diff --git a/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml b/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
index 7e3471593..392ee4e88 100644
--- a/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
+++ b/roles/kubernetes/node/tasks/loadbalancer/kube-vip.yml
@@ -6,6 +6,21 @@
     - kube_proxy_mode == 'ipvs' and not kube_proxy_strict_arp
     - kube_vip_arp_enabled

+- name: Kube-vip | Check if super-admin.conf exists
+  stat:
+    path: "{{ kube_config_dir }}/super-admin.conf"
+  failed_when: false
+  changed_when: false
+  register: stat_kube_vip_super_admin
+
+- name: Kube-vip | Check if kubeadm has already run
+  stat:
+    path: "/var/lib/kubelet/config.yaml"
+    get_attributes: no
+    get_checksum: no
+    get_mime: no
+  register: kubeadm_already_run
+
 - name: Kube-vip | Write static pod
   template:
     src: manifests/kube-vip.manifest.j2

Seljuke avatar Aug 06 '24 23:08 Seljuke

@Seljuke Could you create a PR with your changes? You have done more testing than me.

Maybe the condition should be kubeadm not already run and super-admin.conf exists (because when kubeadm has run, admin.conf should work)?

sathieu avatar Aug 07 '24 06:08 sathieu

@Seljuke Could you create a PR with your changes? You have done more testing than me.

Maybe the condition should be kubeadm not already run and super-admin.conf exists (because when kubeadm has run, admin.conf should work)?

#11422 PR created. super-admin.conf gets created in kubeadm init phase kubeconfig super-admin phase and for admin.conf to work RBAC should be fully configured which wont be until first control-plane starts.

Seljuke avatar Aug 07 '24 10:08 Seljuke

Closing in favor of https://github.com/kubernetes-sigs/kubespray/pull/11422. Thanks @Seljuke

sathieu avatar Aug 07 '24 13:08 sathieu