kops aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes"

aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes"

Open stl-victor-sudakov opened this issue 4 months ago • 8 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information. 1.30.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. v1.29.3

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops upgrade cluster --name XXX --kubernetes-version 1.29.9 --yes kops --name XXX update cluster --yes --admin kops --name XXX rolling-update cluster --yes

5. What happened after the commands executed? Cluster did not pass validation at the very beginning of the upgrade procedure:

$ kops rolling-update cluster --yes --name XXX
Detected single-control-plane cluster; won't detach before draining
NAME                            STATUS          NEEDUPDATE      READY   MIN     TARGET  MAX     NODES
control-plane-us-west-2c        NeedsUpdate     1               0       1       1       1       1
nodes-us-west-2c                NeedsUpdate     4               0       4       4       4       4
I1002 15:03:05.336312   37988 instancegroups.go:507] Validating the cluster.
I1002 15:03:29.806323   37988 instancegroups.go:566] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
I1002 15:04:22.511826   37988 instancegroups.go:566] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
[...]

002 15:18:58.830547   37988 instancegroups.go:563] Cluster did not pass validation within deadline: system-cluster-critical pod "aws-node-termination-handler-577f866468-mmlx7" is pending.
E1002 15:18:58.830585   37988 instancegroups.go:512] Cluster did not validate within 15m0s
Error: control-plane node not healthy after update, stopping rolling-update: "error validating cluster: cluster did not validate within a duration of \"15m0s\""

When I looked up why the pod was pending, I found the following in "describe pod aws-node-termination-handler-577f866468-mmlx7":

0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling.

There is another aws-node-termination-handler- pod running at the moment (the old one):

$ kubectl -n kube-system get pods -l k8s-app=aws-node-termination-handler
NAME                                            READY   STATUS    RESTARTS          AGE
aws-node-termination-handler-577f866468-mmlx7   0/1     Pending   0                 69m
aws-node-termination-handler-6c9c8d7948-fxsrl   1/1     Running   1338 (4h1m ago)   133d

6. What did you expect to happen?

I expected the cluster to be upgraded go Kubernetes 1.29.9

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2023-07-05T02:16:44Z"
  generation: 9
  name: YYYY
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://XXXX/YYYY
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-west-2c
      name: c
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-west-2c
      name: c
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - X.X.X.X/24
    kubernetesVersion: 1.29.9
  masterPublicName: api.YYYY
  networkCIDR: 172.22.0.0/16
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - X.X.X.X/24
  subnets:
  - cidr: 172.22.32.0/19
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
      type: Public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-07-05T02:16:48Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: YYYY
  name: control-plane-us-west-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2023-07-05T02:16:49Z"
  generation: 7
  labels:
    kops.k8s.io/cluster: YYYY
  name: nodes-us-west-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t3a.xlarge
  maxSize: 4
  minSize: 4
  role: Node
  subnets:
  - us-west-2c

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here. Please see above the validation log.

9. Anything else do we need to know? Now I would like to know how to recover from this situation and how to get rid of the aws-node-termination-handler-577f866468-mmlx7 pod which is now left in Pending state.

Oct 02 '24 11:10 stl-victor-sudakov

kops kops copied to clipboard

aws-node-termination-handler pod is stuck in pending right after "kops rolling-update cluster --yes"

kops
kops copied to clipboard