kops
kops copied to clipboard
aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes"
/kind bug
1. What kops
version are you running? The command kops version
, will display
this information.
1.30.1
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
v1.29.3
3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue? kops upgrade cluster --name XXX --kubernetes-version 1.29.9 --yes kops --name XXX update cluster --yes --admin kops --name XXX rolling-update cluster --yes
5. What happened after the commands executed? Cluster did not pass validation at the very beginning of the upgrade procedure:
$ kops rolling-update cluster --yes --name XXX
Detected single-control-plane cluster; won't detach before draining
NAME STATUS NEEDUPDATE READY MIN TARGET MAX NODES
control-plane-us-west-2c NeedsUpdate 1 0 1 1 1 1
nodes-us-west-2c NeedsUpdate 4 0 4 4 4 4
I1002 15:03:05.336312 37988 instancegroups.go:507] Validating the cluster.
I1002 15:03:29.806323 37988 instancegroups.go:566] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
I1002 15:04:22.511826 37988 instancegroups.go:566] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
[...]
002 15:18:58.830547 37988 instancegroups.go:563] Cluster did not pass validation within deadline: system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
E1002 15:18:58.830585 37988 instancegroups.go:512] Cluster did not validate within 15m0s
Error: control-plane node not healthy after update, stopping rolling-update: "error validating cluster: cluster did not validate within a duration of \"15m0s\""
When I looked up why the pod was pending, I found the following in "describe pod aws-node-termination-handler-577f866468-mmlx7":
0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling.
There is another aws-node-termination-handler- pod running at the moment (the old one):
$ kubectl -n kube-system get pods -l k8s-app=aws-node-termination-handler
NAME READY STATUS RESTARTS AGE
aws-node-termination-handler-577f866468-mmlx7 0/1 Pending 0 69m
aws-node-termination-handler-6c9c8d7948-fxsrl 1/1 Running 1338 (4h1m ago) 133d
6. What did you expect to happen?
I expected the cluster to be upgraded go Kubernetes 1.29.9
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2023-07-05T02:16:44Z"
generation: 9
name: YYYY
spec:
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://XXXX/YYYY
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- encryptedVolume: true
instanceGroup: control-plane-us-west-2c
name: c
manager:
backupRetentionDays: 90
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- encryptedVolume: true
instanceGroup: control-plane-us-west-2c
name: c
manager:
backupRetentionDays: 90
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- X.X.X.X/24
kubernetesVersion: 1.29.9
masterPublicName: api.YYYY
networkCIDR: 172.22.0.0/16
networking:
calico: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- X.X.X.X/24
subnets:
- cidr: 172.22.32.0/19
name: us-west-2c
type: Public
zone: us-west-2c
topology:
dns:
type: Public
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-07-05T02:16:48Z"
generation: 5
labels:
kops.k8s.io/cluster: YYYY
name: control-plane-us-west-2c
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
instanceMetadata:
httpPutResponseHopLimit: 3
httpTokens: required
machineType: t3a.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-west-2c
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2023-07-05T02:16:49Z"
generation: 7
labels:
kops.k8s.io/cluster: YYYY
name: nodes-us-west-2c
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t3a.xlarge
maxSize: 4
minSize: 4
role: Node
subnets:
- us-west-2c
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Please see above the validation log.
9. Anything else do we need to know? Now I would like to know how to recover from this situation and how to get rid of the aws-node-termination-handler-577f866468-mmlx7 pod which is now left in Pending state.