kubeadm
kubeadm copied to clipboard
upgrade-health-check Job fails on a single control plane node cluster after drain
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use kubeadm version
): v1.30.0
Environment:
-
Kubernetes version (use
kubectl version
): v1.30.0 - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): Ubuntu 22.04.1 LTS
-
Kernel (e.g.
uname -a
): 5.15.0-50-generic - Container runtime (CRI) (e.g. containerd, cri-o): containerd=1.6.12-0ubuntu1~22.04.3
- Container networking plugin (CNI) (e.g. Calico, Cilium): calico
- Others:
What happened?
- Install a single control plane node cluster v1.29.1
- Drain the only node
-
kubeadm upgrade apply v1.30.0
fails with
[ERROR CreateJob]: Job "upgrade-health-check-lvr8s" in the namespace "kube-system" did not complete in 15s: client rate limiter Wait returned an error: context deadline exceeded
It seems that previously the pod was Pending as well, but this was ignored, because the jobs was successfully deleted in defer
and return value was overridden with nil.
defer func() {
lastError = deleteHealthCheckJob(client, ns, jobName)
}()
https://github.com/kubernetes/kubernetes/blob/v1.29.1/cmd/kubeadm/app/phases/upgrade/health.go#L151
Similar issue #2035
What you expected to happen?
There might be no need to create the job.
How to reproduce it (as minimally and precisely as possible)?
See What happened?
thanks for testing @ilia1243
this is tricky problem. but either way there is a regression in 1.30 that we need to fix.
the only problem here seems to be with the CreateJob logic.
k drain node
sudo kubeadm upgrade apply -f v1.30.0 --ignore-preflight-errors=CreateJob
^ this completes the upgrade of a single node CP and addons are applied correctly. but the CreateJob check will always fail.
one option is to skip this check if there is a single CP node in the cluster. WDYT?
cc @SataQiu @pacoxu @carlory
one option is to skip this check if there is a single CP node in the cluster.
another option (for me less preferred) is to make the CreateJob health check return a warning instead of an error. it will always show a warning on single node CP cluster.
+1 for skip
I need to check it again. I think I faiI for another reason that I did not install CNI for control plane and the pod failed for no CNI(not sure if this is a general use case, in this case the job should be run in hostNetwork.). I will do some test tomorrow.
My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?
My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?
IIUC, the only way to test if a job pod can schedule somewhere is to try to create the same job? the problem is that this preflight check's purpose is exactly that - check if the cluster accepts workloads.
i don't even remember why we added it, but now we need a fix it right away. perhaps later we can discuss removing it.
My suggestion is to print a warning and skip Job creation when there are no nodes to schedule. WDYT?
IIUC, the only way to test if a job pod can schedule somewhere is to try to create the same job? the problem is that this preflight check's purpose is exactly that - check if the cluster accepts workloads.
i don't even remember why we added it, but now we need a fix it right away. perhaps later we can discuss removing it.
we could look at Unschedulable taints on nodes, which means they were cordoned.
but listing all nodes on every kubeadm upgrade command will be very expensive in big scale clusters with many nodes.
so i am starting to think we should just convert this check to a preflight warning.
I'm not sure whether it is an expected patch for this issue?
i.e. add a new toleration into job like
{key=node.kubernetes.io/unschedulable, effect:NoSchedule}
FYI:
- https://github.com/kubernetes/kubernetes/blob/e59eceec480e1e181e38bc29e2c01652ec3c671c/cmd/kubeadm/app/phases/upgrade/health.go#L126
- https://github.com/kubernetes/kubernetes/blob/e59eceec480e1e181e38bc29e2c01652ec3c671c/pkg/scheduler/framework/plugins/nodeunschedulable/node_unschedulable.go#L32
Or just convert this check to a preflight warning?
Or just convert this check to a preflight warning?
i have a WIP PR for this.
I'm not sure whether it is an expected patch for this issue?
i don't know... ideally a node should be drained before upgrading kubelet.
so if we allow pods to schedule after the node is drained with the {key=node.kubernetes.io/unschedulable, effect:NoSchedule}
hack, we are breaking this rule. i don't even know if it will work.
we do upgrade coredns and kube-proxy for a single node cluster while the node is drained with kubeadm upgrade apply, but we do ignore daemon sets anyway and the coredns pods will remain pending if the node is not schedulable. so technically for the addons we don't schedule new pods IIUC.
i have a WIP PR for this.
please see https://github.com/kubernetes/kubernetes/pull/124503 and my comments there.
@carlory came up with a good idea how to catch the scenario. https://github.com/kubernetes/kubernetes/pull/124503#discussion_r1577693197 the PR is updated.
more reviews are appreciated.
fix will be added to 1.30.1 https://github.com/kubernetes/kubernetes/pull/124570
1.30.1 is out with the fix