cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

aws cloud controller manager is unable to manage the nodes in cluster

Open karty-s opened this issue 3 months ago • 5 comments

What happened: We are running k8s cluster of version 1.26 using kubeadm with resources from aws. We wanted to upgrade our clusters to 1.28 (1.26->1.27->1.28) as per update notes we tried to move from in-tree aws cloud provider to external aws cloud provider. As per the upgrade process we deployed the new 1.27 nodes along with aws cloud controller manager in the cluster, post which we scaled down the 1.26 nodes.

What you expected to happen: The issue we face is that the etcd and worker nodes of 1.26 version which is scaled down gets removed from the cluster, but the control plane nodes still shows up in the cluster even after its ec2 instance is removed. eg -

NAME                            STATUS                     ROLES                  AGE     VERSION
ip-.ec2.internal   Ready,SchedulingDisabled   control-plane,master   96m     v1.26.7
ip-.ec2.internal   Ready                      etcd                   11m     v1.27.13
ip-.ec2.internal   Ready                      etcd                   9m10s   v1.27.13
ip-.ec2.internal   Ready                      control-plane,master   5m59s   v1.27.13
ip-.ec2.internal   Ready,SchedulingDisabled   control-plane,master   95m     v1.26.7
ip-.ec2.internal    Ready                      node                   6m12s   v1.27.13
ip-.ec2.internal    Ready                      etcd                   14m     v1.27.13
ip-.ec2.internal    Ready                      control-plane,master   6m1s    v1.27.13
ip-.ec2.internal    Ready                      node                   6m9s    v1.27.13
ip-.ec2.internal    Ready                      node                   6m14s   v1.27.13
ip-.ec2.internal    Ready                      node                   6m15s   v1.27.13
ip-.ec2.internal    Ready,SchedulingDisabled   control-plane,master   96m     v1.26.7
ip-.ec2.internal    Ready                      node                   6m15s   v1.27.13
ip-.ec2.internal    Ready                      node                   6m15s   v1.27.13
ip-.ec2.internal    Ready                      control-plane,master   5m43s   v1.27.13

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: we are seeing this error in the cloud controller manager pod logs -

I0516 08:13:24.811572       1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-10-230-13-35.ec2.internal
I0516 08:13:24.812083       1 event.go:307] "Event occurred" object="ip-10-230-13-35.ec2.internal" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-10-230-13-35.ec2.internal because it does not exist in the cloud provider"

we have set the hostname according to the pre req but still we get this

Environment: kubeadm

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7", GitCommit:"84e1fc493a47446df2e155e70fca768d2653a398", GitTreeState:"clean", BuildDate:"2023-07-19T12:23:27Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
  • Cloud provider or hardware configuration: aws

  • OS (e.g. from /etc/os-release):

NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3374.2.4
VERSION_ID=3374.2.4
BUILD_ID=2023-02-15-1824
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)"
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

karty-s avatar May 16 '24 13:05 karty-s