After rebooting the master and the nodes the cluster is unavailable
/kind bug
**1. What kops version are you running?
Version 1.23.2
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8", GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using? AWS 4. What commands did you run? What is the simplest way to reproduce this issue? We had a failure in the communication with the cluster, I rebooted the master and the nodes manually thru AWS console
5. What happened after the commands executed? Cluster became unavailable
kops validate cluster
Validating cluster zonetv.dev1.k8s.local
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
master-us-east-1a Master t3a.small 1 1 us-east-1a
nodes Node t3.medium 3 20 us-east-1a
NODE STATUS
NAME ROLE READY
ip-172-30-5-231.ec2.internal master True
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-029b1bfc1c758ee50 machine "i-029b1bfc1c758ee50" has not yet joined cluster
Machine i-0695138ee08611f03 machine "i-0695138ee08611f03" has not yet joined cluster
Machine i-0bc04e7e662f76c07 machine "i-0bc04e7e662f76c07" has not yet joined cluster
Pod kube-system/coredns-6d467c5d59-4qsfs system-cluster-critical pod "coredns-6d467c5d59-4qsfs" is pending
Pod kube-system/coredns-6d467c5d59-tftrx system-cluster-critical pod "coredns-6d467c5d59-tftrx" is pending
Pod kube-system/coredns-autoscaler-5c7694cfcc-wj98s system-cluster-critical pod "coredns-autoscaler-5c7694cfcc-wj98s" is pending
Pod kube-system/kops-controller-fwzj8 system-cluster-critical pod "kops-controller-fwzj8" is not ready (kops-controller)
Pod kube-system/kube-dns-67689f84b-qm9d2 system-cluster-critical pod "kube-dns-67689f84b-qm9d2" is pending
Pod kube-system/kube-dns-67689f84b-xddcn system-cluster-critical pod "kube-dns-67689f84b-xddcn" is pending
Pod kube-system/kube-dns-autoscaler-5b55dbf76d-d46fl system-cluster-critical pod "kube-dns-autoscaler-5b55dbf76d-d46fl" is pending
Validation Failed
6. What did you expect to happen? Cluster recovers as it did in a couple of similar occasions
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2020-04-08T09:55:55Z"
generation: 5
name: zonetv.dev1.k8s.local
spec:
additionalNetworkCIDRs:
- 2600:a:b:c::/56
api:
loadBalancer:
class: Classic
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://zonetv-imesh-kops-state-store/zonetv.dev1.k8s.local
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-us-east-1a
name: a
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-us-east-1a
name: a
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.20.8
masterInternalName: api.internal.zonetv.dev1.k8s.local
masterPublicName: api.zonetv.dev1.k8s.local
networkCIDR: 172.30.0.0/16
networkID: vpc-b60475d3
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.30.5.0/24
name: us-east-1a
type: Public
zone: us-east-1a
- cidr: 2600:a:b:d::/64
name: us-east-1a1
type: Public
zone: us-east-1a
topology:
dns:
type: Public
masters: public
nodes: public
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-04-08T09:55:55Z"
generation: 9
labels:
kops.k8s.io/cluster: zonetv.dev1.k8s.local
name: master-us-east-1a
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210503
machineType: t3a.small
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-1a
role: Master
subnets:
- us-east-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2020-04-08T09:55:55Z"
generation: 7
labels:
kops.k8s.io/cluster: zonetv.dev1.k8s.local
name: nodes
spec:
image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210503
machineType: t3.medium
maxSize: 20
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: nodes
role: Node
subnets:
- us-east-1a
9. Anything else do we need to know? kops-controller crashes
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6d467c5d59-4qsfs 0/1 Pending 0 17h
coredns-6d467c5d59-tftrx 0/1 Pending 0 17h
coredns-autoscaler-5c7694cfcc-wj98s 0/1 Pending 0 17h
dns-controller-5954d849fc-2mgzz 1/1 Running 0 15h
etcd-manager-events-ip-172-30-5-231.ec2.internal 1/1 Running 0 15h
etcd-manager-main-ip-172-30-5-231.ec2.internal 1/1 Running 0 15h
kops-controller-fwzj8 0/1 CrashLoopBackOff 185 15h
kube-apiserver-ip-172-30-5-231.ec2.internal 2/2 Running 0 15h
kube-controller-manager-ip-172-30-5-231.ec2.internal 1/1 Running 0 15h
kube-dns-67689f84b-qm9d2 0/3 Pending 0 17h
kube-dns-67689f84b-xddcn 0/3 Pending 0 17h
kube-dns-autoscaler-5b55dbf76d-d46fl 0/1 Pending 0 17h
kube-proxy-ip-172-30-5-231.ec2.internal 1/1 Running 0 15h
kube-scheduler-ip-172-30-5-231.ec2.internal 1/1 Running 0 15h
metrics-server-78f4f48675-vzjmz 0/1 Pending 0 17h
kubectl logs kops-controller-fwzj8 -n kube-system
E1007 16:28:06.089598 1 logr.go:265] setup "msg"="unable to start server" "error"="reading \"kubernetes-ca\" certificate: open /etc/kubernetes/kops-controller/pki/kubernetes-ca.crt: no such file or directory"
I tried following the advice from https://github.com/kubernetes/kops/issues/10704, Applying the k8s-1.16.yaml file from the S3 state, following by running kops delete instance instance-id --yes --cloudonly but with no success