After rebooting the master and the nodes the cluster is unavailable

Open IgalSc opened this issue 3 years ago • 0 comments

/kind bug

**1. What kops version are you running?

Version 1.23.2

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8", GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? AWS 4. What commands did you run? What is the simplest way to reproduce this issue? We had a failure in the communication with the cluster, I rebooted the master and the nodes manually thru AWS console

5. What happened after the commands executed? Cluster became unavailable

 kops validate cluster
Validating cluster zonetv.dev1.k8s.local

INSTANCE GROUPS
NAME                    ROLE    MACHINETYPE     MIN     MAX     SUBNETS
master-us-east-1a       Master  t3a.small       1       1       us-east-1a
nodes                   Node    t3.medium       3       20      us-east-1a

NODE STATUS
NAME                            ROLE    READY
ip-172-30-5-231.ec2.internal    master  True

VALIDATION ERRORS
KIND    NAME                                                    MESSAGE
Machine i-029b1bfc1c758ee50                                     machine "i-029b1bfc1c758ee50" has not yet joined cluster
Machine i-0695138ee08611f03                                     machine "i-0695138ee08611f03" has not yet joined cluster
Machine i-0bc04e7e662f76c07                                     machine "i-0bc04e7e662f76c07" has not yet joined cluster
Pod     kube-system/coredns-6d467c5d59-4qsfs                    system-cluster-critical pod "coredns-6d467c5d59-4qsfs" is pending
Pod     kube-system/coredns-6d467c5d59-tftrx                    system-cluster-critical pod "coredns-6d467c5d59-tftrx" is pending
Pod     kube-system/coredns-autoscaler-5c7694cfcc-wj98s         system-cluster-critical pod "coredns-autoscaler-5c7694cfcc-wj98s" is pending
Pod     kube-system/kops-controller-fwzj8                       system-cluster-critical pod "kops-controller-fwzj8" is not ready (kops-controller)
Pod     kube-system/kube-dns-67689f84b-qm9d2                    system-cluster-critical pod "kube-dns-67689f84b-qm9d2" is pending
Pod     kube-system/kube-dns-67689f84b-xddcn                    system-cluster-critical pod "kube-dns-67689f84b-xddcn" is pending
Pod     kube-system/kube-dns-autoscaler-5b55dbf76d-d46fl        system-cluster-critical pod "kube-dns-autoscaler-5b55dbf76d-d46fl" is pending

Validation Failed

6. What did you expect to happen? Cluster recovers as it did in a couple of similar occasions

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-04-08T09:55:55Z"
  generation: 5
  name: zonetv.dev1.k8s.local
spec:
  additionalNetworkCIDRs:
  - 2600:a:b:c::/56
  api:
    loadBalancer:
      class: Classic
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://zonetv-imesh-kops-state-store/zonetv.dev1.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.20.8
  masterInternalName: api.internal.zonetv.dev1.k8s.local
  masterPublicName: api.zonetv.dev1.k8s.local
  networkCIDR: 172.30.0.0/16
  networkID: vpc-b60475d3
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.30.5.0/24
    name: us-east-1a
    type: Public
    zone: us-east-1a
  - cidr: 2600:a:b:d::/64
    name: us-east-1a1
    type: Public
    zone: us-east-1a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-04-08T09:55:55Z"
  generation: 9
  labels:
    kops.k8s.io/cluster: zonetv.dev1.k8s.local
  name: master-us-east-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210503
  machineType: t3a.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1a
  role: Master
  subnets:
  - us-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-04-08T09:55:55Z"
  generation: 7
  labels:
    kops.k8s.io/cluster: zonetv.dev1.k8s.local
  name: nodes
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210503
  machineType: t3.medium
  maxSize: 20
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-east-1a

9. Anything else do we need to know? kops-controller crashes

kubectl get pods -n kube-system
NAME                                                   READY   STATUS             RESTARTS   AGE
coredns-6d467c5d59-4qsfs                               0/1     Pending            0          17h
coredns-6d467c5d59-tftrx                               0/1     Pending            0          17h
coredns-autoscaler-5c7694cfcc-wj98s                    0/1     Pending            0          17h
dns-controller-5954d849fc-2mgzz                        1/1     Running            0          15h
etcd-manager-events-ip-172-30-5-231.ec2.internal       1/1     Running            0          15h
etcd-manager-main-ip-172-30-5-231.ec2.internal         1/1     Running            0          15h
kops-controller-fwzj8                                  0/1     CrashLoopBackOff   185        15h
kube-apiserver-ip-172-30-5-231.ec2.internal            2/2     Running            0          15h
kube-controller-manager-ip-172-30-5-231.ec2.internal   1/1     Running            0          15h
kube-dns-67689f84b-qm9d2                               0/3     Pending            0          17h
kube-dns-67689f84b-xddcn                               0/3     Pending            0          17h
kube-dns-autoscaler-5b55dbf76d-d46fl                   0/1     Pending            0          17h
kube-proxy-ip-172-30-5-231.ec2.internal                1/1     Running            0          15h
kube-scheduler-ip-172-30-5-231.ec2.internal            1/1     Running            0          15h
metrics-server-78f4f48675-vzjmz                        0/1     Pending            0          17h

kubectl logs kops-controller-fwzj8 -n kube-system
E1007 16:28:06.089598       1 logr.go:265] setup "msg"="unable to start server" "error"="reading \"kubernetes-ca\" certificate: open /etc/kubernetes/kops-controller/pki/kubernetes-ca.crt: no such file or directory"

I tried following the advice from https://github.com/kubernetes/kops/issues/10704, Applying the k8s-1.16.yaml file from the S3 state, following by running kops delete instance instance-id --yes --cloudonly but with no success

Oct 07 '22 16:10 IgalSc