kubeadm icon indicating copy to clipboard operation
kubeadm copied to clipboard

upgrade-health-check Job fails on a single control plane node cluster after drain

Open ilia1243 opened this issue 10 months ago • 11 comments

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): v1.30.0

Environment:

  • Kubernetes version (use kubectl version): v1.30.0
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.1 LTS
  • Kernel (e.g. uname -a): 5.15.0-50-generic
  • Container runtime (CRI) (e.g. containerd, cri-o): containerd=1.6.12-0ubuntu1~22.04.3
  • Container networking plugin (CNI) (e.g. Calico, Cilium): calico
  • Others:

What happened?

  1. Install a single control plane node cluster v1.29.1
  2. Drain the only node
  3. kubeadm upgrade apply v1.30.0 fails with
[ERROR CreateJob]: Job "upgrade-health-check-lvr8s" in the namespace "kube-system" did not complete in 15s: client rate limiter Wait returned an error: context deadline exceeded

It seems that previously the pod was Pending as well, but this was ignored, because the jobs was successfully deleted in defer and return value was overridden with nil.

defer func() {
    lastError = deleteHealthCheckJob(client, ns, jobName)
}()

https://github.com/kubernetes/kubernetes/blob/v1.29.1/cmd/kubeadm/app/phases/upgrade/health.go#L151

Similar issue #2035

What you expected to happen?

There might be no need to create the job.

How to reproduce it (as minimally and precisely as possible)?

See What happened?

ilia1243 avatar Apr 23 '24 16:04 ilia1243

thanks for testing @ilia1243

this is tricky problem. but either way there is a regression in 1.30 that we need to fix.

the only problem here seems to be with the CreateJob logic.

k drain node
sudo kubeadm upgrade apply -f v1.30.0 --ignore-preflight-errors=CreateJob

^ this completes the upgrade of a single node CP and addons are applied correctly. but the CreateJob check will always fail.

one option is to skip this check if there is a single CP node in the cluster. WDYT?

cc @SataQiu @pacoxu @carlory

neolit123 avatar Apr 23 '24 17:04 neolit123

one option is to skip this check if there is a single CP node in the cluster.

another option (for me less preferred) is to make the CreateJob health check return a warning instead of an error. it will always show a warning on single node CP cluster.

neolit123 avatar Apr 23 '24 17:04 neolit123

+1 for skip

I need to check it again. I think I faiI for another reason that I did not install CNI for control plane and the pod failed for no CNI(not sure if this is a general use case, in this case the job should be run in hostNetwork.). I will do some test tomorrow.

pacoxu avatar Apr 23 '24 17:04 pacoxu

My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?

SataQiu avatar Apr 24 '24 01:04 SataQiu

My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?

IIUC, the only way to test if a job pod can schedule somewhere is to try to create the same job? the problem is that this preflight check's purpose is exactly that - check if the cluster accepts workloads.

i don't even remember why we added it, but now we need a fix it right away. perhaps later we can discuss removing it.

neolit123 avatar Apr 24 '24 04:04 neolit123

My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?

IIUC, the only way to test if a job pod can schedule somewhere is to try to create the same job? the problem is that this preflight check's purpose is exactly that - check if the cluster accepts workloads.

i don't even remember why we added it, but now we need a fix it right away. perhaps later we can discuss removing it.

we could look at Unschedulable taints on nodes, which means they were cordoned.

but listing all nodes on every kubeadm upgrade command will be very expensive in big scale clusters with many nodes.

so i am starting to think we should just convert this check to a preflight warning.

neolit123 avatar Apr 24 '24 04:04 neolit123

I'm not sure whether it is an expected patch for this issue?

i.e. add a new toleration into job like

{key=node.kubernetes.io/unschedulable, effect:NoSchedule} 

FYI:

  • https://github.com/kubernetes/kubernetes/blob/e59eceec480e1e181e38bc29e2c01652ec3c671c/cmd/kubeadm/app/phases/upgrade/health.go#L126
  • https://github.com/kubernetes/kubernetes/blob/e59eceec480e1e181e38bc29e2c01652ec3c671c/pkg/scheduler/framework/plugins/nodeunschedulable/node_unschedulable.go#L32

Or just convert this check to a preflight warning?

carlory avatar Apr 24 '24 10:04 carlory

Or just convert this check to a preflight warning?

i have a WIP PR for this.

I'm not sure whether it is an expected patch for this issue?

i don't know... ideally a node should be drained before upgrading kubelet. so if we allow pods to schedule after the node is drained with the {key=node.kubernetes.io/unschedulable, effect:NoSchedule} hack, we are breaking this rule. i don't even know if it will work.

we do upgrade coredns and kube-proxy for a single node cluster while the node is drained with kubeadm upgrade apply, but we do ignore daemon sets anyway and the coredns pods will remain pending if the node is not schedulable. so technically for the addons we don't schedule new pods IIUC.

neolit123 avatar Apr 24 '24 10:04 neolit123

i have a WIP PR for this.

please see https://github.com/kubernetes/kubernetes/pull/124503 and my comments there.

neolit123 avatar Apr 24 '24 10:04 neolit123

@carlory came up with a good idea how to catch the scenario. https://github.com/kubernetes/kubernetes/pull/124503#discussion_r1577693197 the PR is updated.

more reviews are appreciated.

neolit123 avatar Apr 25 '24 09:04 neolit123

fix will be added to 1.30.1 https://github.com/kubernetes/kubernetes/pull/124570

neolit123 avatar Apr 26 '24 17:04 neolit123

1.30.1 is out with the fix

neolit123 avatar May 17 '24 13:05 neolit123