kind icon indicating copy to clipboard operation
kind copied to clipboard

Join of worker node fails for specific k8s versions

Open tehcyx opened this issue 2 years ago • 8 comments

Unsure if this is a kind problem at all or more a kubeadm problem or a known problem with those specific k8s versions. But here goes.

What happened:

ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged kind-worker kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6" failed with error: exit status 1

What you expected to happen: Cluster to be created successfully

How to reproduce it (as minimally and precisely as possible): Create a cluster with this config:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.24.0
- role: worker
  image: kindest/node:v1.23.4
- role: worker
  image: kindest/node:v1.24.0

Anything else we need to know?: It works if I use a different worker image and having the other nodes fallback to the standard node image from kind v0.16.0

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  image: kindest/node:v1.24.6@sha256:97e8d00bc37a7598a0b32d1fabd155a96355c49fa0d4d4790aab0f161bf31be1
- role: worker

Environment: Fails as early as kind v0.15.0 from what I saw and still present in v0.16.0.

  • kind version: (use kind version): kind v0.16.0
  • Kubernetes version: (use kubectl version): v1.25.0
  • Docker version: (use docker info): docker desktop v4.12.0
  • OS (e.g. from /etc/os-release): Windows, as well as Github Actions runners

tehcyx avatar Sep 26 '22 19:09 tehcyx

1.23.4 is a very old node image from v0.12.0 kind release, please use an image from the current release (or build your own with the current release to get arbitrary k8s patch versions)

https://kind.sigs.k8s.io/docs/user/quick-start/#creating-a-cluster

Prebuilt images are hosted atkindest/node, but to find images suitable for a given release currently you should check the release notes for your given kind version (check with kind version) where you'll find a complete listing of images created for a kind release.

BenTheElder avatar Sep 26 '22 19:09 BenTheElder

Understood, using newer images already just found it weird that one of my old tests was failing and thought it might be worth reporting.

tehcyx avatar Sep 26 '22 19:09 tehcyx

ACK thanks, please let me know if you see this with current patch versions, it's definitely possible there's a k8s bug. We also changed some things around v0.13 that might cause issues and we'll probably have to stop supporting old images again in the future related to cgroups management 😅

BenTheElder avatar Sep 26 '22 19:09 BenTheElder

I actually found it was still working in v0.14 as well, v0.15 must be the first version it failed.

tehcyx avatar Sep 26 '22 19:09 tehcyx

But does it fail with current node images? There's been a lot of PRs to Kubernetes 1.23 / 1.24 since 1.23.4 / 1.24.0

BenTheElder avatar Sep 26 '22 20:09 BenTheElder

Nope like I said in the initial report, e.g. this config works:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  image: kindest/node:v1.24.6@sha256:97e8d00bc37a7598a0b32d1fabd155a96355c49fa0d4d4790aab0f161bf31be1
- role: worker

which would pull in v0.16.0 node images for control plane and workers, except the hardcoded one.

Unless you mean the 1.23/1.24 images, I'll have to try that out right now.

tehcyx avatar Sep 26 '22 20:09 tehcyx

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.24.6@sha256:97e8d00bc37a7598a0b32d1fabd155a96355c49fa0d4d4790aab0f161bf31be1
- role: worker
  image: kindest/node:v1.23.12@sha256:9402cf1330bbd3a0d097d2033fa489b2abe40d479cc5ef47d0b6a6960613148a
- role: worker
  image: kindest/node:v1.24.6@sha256:97e8d00bc37a7598a0b32d1fabd155a96355c49fa0d4d4790aab0f161bf31be1

This config works without problems.

tehcyx avatar Sep 26 '22 20:09 tehcyx

Thanks -- this makes me think there's a kubernetes bug that was fixed between 1.23.4 ... 1.23.12 or 1.24.0 .. 1.24.6. It's possible instead that images from past kind releases have a bug related to this but I can't think of a relevant change.

BenTheElder avatar Sep 28 '22 02:09 BenTheElder

@tehcyx I faced a similar problem while spawning a cluster with more than three workers. Kind worked perfectly for less than three workers in my case. Tried with 1.23, 1.24. 1.25 and even tried building the node image using k8s latest source but the problem persisted.

After inspecting kubelet logs found too many open files to be the issue to be the issue raised by inotify.

Later found this in known issues as well. Can you increase the limits as mentioned in the following link: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

aroradaman avatar Oct 30 '22 20:10 aroradaman

Yeah, I have reason to suspect recent Kubernetes releases increased the number of inotify watches used on a typical kind cluster, based on issue reports here, but haven't had time to investigate this (also it's not clear that would be considered a bug in Kubernetes), unfortunately since that limit is not namespaced we do not touch it, but it's a common issue with multi-node KIND clusters in general.

BenTheElder avatar Nov 08 '22 06:11 BenTheElder

I think this was some Kubernetes bug and there's not much more for us to do here for now

BenTheElder avatar Apr 18 '23 04:04 BenTheElder