Ec2 instances are not showing in the nodes list for kubectl
/kind bug
**1. What kops version are you running? The command kops version, will display
Version 1.11.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.11.8
3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
export KOPS_STATE_STORE=s3://....
export KOPS_CLUSTER_NAME=stage.cluster.example.com
export KOPS_RUN_OBSOLETE_VERSION=true
kops update cluster
5. What happened after the commands executed?
Initially I was not able to see ec2 instances in the list of nodes when I kubectl get nodes.
After that I have decided to try to test changing minSize and maxSize parameters in kops configuration and to do kops update cluster. After running this command, I have got this error:
W0602 17:16:48.350262 1505682 launchconfiguration.go:197] unable to resolve image: "kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2021-02-05": could not find Image for "kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2021-02-05"
Then I updated image to ami-0c259a97cbf621daf (since I read that kope.io images are not available anymore). This is Ubuntu 18.04 (which I found here https://cloud-images.ubuntu.com/locator/ec2/)
After that the cluster was updated. I have got the following at the end:
kops has set your kubectl context to stage.cluster.example.com
Cluster changes have been applied to the cloud.
Changes may require instances to restart: kops rolling-update cluster
After that when I run kubectl get nodes I do not see those nodes visible (ec2 instances are created as a part of the AutoScalingGroup)
6. What did you expect to happen?
I want to see those nodes visible when I run kubectl get nodes (ec2 instances are created as a part of the AutoScalingGroup)
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
I suggest you try and look at https://kops.sigs.k8s.io/operations/troubleshoot/ and maybe find what is the issue in instance logs.
@hakman Thanks for reply. Not able to actually ssh from bastion to the nodes at this point, after running rolling-update command :/
This is what I can see when I run validate cluster
$ kops validate cluster Validating cluster staging.k8s.example.com
unexpected error during validation: error listing nodes: Get https://api.staging.k8s.example.com/api/v1/nodes: EOF
Also, I can see this
$ curl https://api.staging.k8s.example.com/api/v1/nodes curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to api.staging.k8s.example.com:443
So, it seems like this got broken during rolling-update
@bobanda87 You chose an Ubuntu 18.04 image that was experimental at the time of kOps 1.11. You may have more luck with Debian Stretch from https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch.
@hakman Thanks for the reply! This used to be my image previously kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17 (exactly as you said, Debian stretch), but this source is not available anymore. Can you recommend a specific instance that could replace it? (eu-west-1 region if that matters)
That image was kope.io image based on Debian Stretch, not the official Debian one.
I cannot recommend any image, just suggest you look at https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch. kOps 1.11 is ancient and I doubt there are many other people still using it.
@hakman thanks for the suggestion! I have tried several amis from that list but they didn’t give me anything different then I had on the starting point (f.ex. ami-01f43da22ee0fbf95).
Can you help me navigate the next kops version that is not complicated to update to? Are there any step by step guides for it?
Do you know where I can find working AMIs with that kops version (that you suggest that I upgrade to)? Will that be compatible with kubectl version 1.11?
This is what I see when I validate the cluster
$ kops validate cluster Validating cluster staging.k8s.example.com
INSTANCE GROUPS NAME ROLE MACHINETYPE MIN MAX SUBNETS bastions Bastion t2.micro 1 1 utility-eu-west-1a,utility-eu-west-1b,utility-eu-west-1c ci Node t3.large 1 6 eu-west-1a,eu-west-1b,eu-west-1c master-eu-west-1a Master t3.medium 1 1 eu-west-1a master-eu-west-1b Master t3.medium 1 1 eu-west-1b master-eu-west-1c Master t3.medium 1 1 eu-west-1c nodes Node t3.large 1 8 eu-west-1a,eu-west-1b,eu-west-1c
NODE STATUS NAME ROLE READY ip-172-20-118-39.eu-west-1.compute.internal master True ip-172-20-48-246.eu-west-1.compute.internal master True ip-172-20-70-32.eu-west-1.compute.internal master True
VALIDATION ERRORS KIND NAME MESSAGE Machine i-02794359d48b221d9 machine "i-02794359d48b221d9" has not yet joined cluster Machine i-03da8ef9491cf4c68 machine "i-03da8ef9491cf4c68" has not yet joined cluster Pod kube-system/calico-node-88j2n kube-system pod "calico-node-88j2n" is not healthy Pod kube-system/tiller-deploy-8cdf857fb-67s6h kube-system pod "tiller-deploy-8cdf857fb-67s6h" is not healthy
Validation Failed
Try setting the image as 379101102735/debian-stretch-hvm-x86_64-gp2-2022-07-01-66430.
If that doesn't work, the only helpful info is ssh and check the kops-configuration service log.
There is no easy upgrade to newer version of kOps after 5 years, sorry :).
kOps 1.19 is the newest version of kOps that supports k8s 1.11 and 1.12. But you're probably better off standing up a new cluster and moving your workloads.
So, it looks pretty similar with 379101102735/debian-stretch-hvm-x86_64-gp2-2022-07-01-66430 AMI unfortunately :(
$ kops rolling-update cluster --yes
NAME STATUS NEEDUPDATE READY MIN MAX NODES
bastions Ready 0 1 1 1 0
ci NeedsUpdate 1 0 1 6 0
master-eu-west-1a Ready 0 1 1 1 1
master-eu-west-1b Ready 0 1 1 1 1
master-eu-west-1c Ready 0 1 1 1 1
nodes Ready 0 1 1 8 0
W0606 20:13:59.578007 1694641 instancegroups.go:175] Skipping drain of instance "i-02794359d48b221d9", because it is not registered in kubernetes
W0606 20:13:59.578053 1694641 instancegroups.go:183] no kubernetes Node associated with i-02794359d48b221d9, skipping node deletion
I0606 20:13:59.578075 1694641 instancegroups.go:301] Stopping instance "i-02794359d48b221d9", in group "ci.staging.k8s.snappo.com" (this may take a while).
I0606 20:13:59.888383 1694641 instancegroups.go:198] waiting for 4m0s after terminating instance
I0606 20:17:59.888527 1694641 instancegroups.go:209] Validating the cluster.
I0606 20:18:01.708379 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:18:32.555451 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:19:02.845418 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:19:32.599475 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:20:03.294402 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:20:32.624184 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
I0606 20:21:02.605659 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
I0606 20:21:32.899162 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
I0606 20:22:02.576949 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0ebdd3f99f7aee0d2" has not yet joined cluster.
I0606 20:22:32.757390 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
E0606 20:23:01.708545 1694641 instancegroups.go:214] Cluster did not validate within 5m0s
error validating cluster after removing a node: cluster did not validate within a duration of "5m0s"
$ kops validate cluster
Validating cluster staging.k8s.example.com
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
bastions Bastion t2.micro 1 1 utility-eu-west-1a,utility-eu-west-1b,utility-eu-west-1c
ci Node t3.large 1 6 eu-west-1a,eu-west-1b,eu-west-1c
master-eu-west-1a Master t3.medium 1 1 eu-west-1a
master-eu-west-1b Master t3.medium 1 1 eu-west-1b
master-eu-west-1c Master t3.medium 1 1 eu-west-1c
nodes Node t3.large 1 8 eu-west-1a,eu-west-1b,eu-west-1c
NODE STATUS
NAME ROLE READY
ip-172-20-118-39.eu-west-1.compute.internal master True
ip-172-20-48-246.eu-west-1.compute.internal master True
ip-172-20-70-32.eu-west-1.compute.internal master True
VALIDATION ERRORS
KIND NAME MESSAGE
Machine i-0235e366bde060290 machine "i-0235e366bde060290" has not yet joined cluster
Machine i-0ebdd3f99f7aee0d2 machine "i-0ebdd3f99f7aee0d2" has not yet joined cluster
Pod kube-system/tiller-deploy-8cdf857fb-67s6h kube-system pod "tiller-deploy-8cdf857fb-67s6h" is not healthy
Validation Failed
the only helpful info is ssh and check the
kops-configurationservice log.
Can you be a bit more specific where to ssh and check this configuration? (apologies if it is a dummy question, but I inherited this setup and am not fully familiar with kops :) )
You would ssh into the instance that isn't joining.
https://kops.sigs.k8s.io/operations/troubleshoot/
I can see this pod tiller-deploy here
Name: tiller-deploy-8cdf857fb-68ckj
Namespace: kube-system
Priority: 0
PriorityClassName: <none>
Node: ip-172-20-70-32.eu-west-1.compute.internal/172.20.70.32
Start Time: Wed, 07 Jun 2023 13:07:36 +0300
Labels: app=helm
name=tiller
pod-template-hash=478941396
Annotations: <none>
Status: Pending
IP: 100.105.54.165
Controlled By: ReplicaSet/tiller-deploy-8cdf857fb
Init Containers:
init-tiller:
Container ID: docker://2b7a3ad25aa43a3759254cd54d755494f8967aa6873fb359821094354dcfea11
Image: eu.gcr.io/kyma-project/alpine-net:0.2.74
Image ID: docker-pullable://eu.gcr.io/kyma-project/alpine-net@sha256:febf714b1d6ff93331406ed972c038e61e393137d923ba0f1d7cdbd1a16fda24
Port: <none>
Host Port: <none>
Command:
sh
-c
until nc -zv kube-dns.kube-system.svc.cluster.local 53; do echo waiting for k8s readiness; sleep 2; done;
State: Running
Started: Wed, 07 Jun 2023 13:20:02 +0300
Ready: False
Restart Count: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from tiller-token-6bksf (ro)
Containers:
tiller:
Container ID:
Image: ghcr.io/helm/tiller:v2.13.0
Image ID:
Port: 44134/TCP
Host Port: 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Liveness: http-get http://:44135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:44135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
TILLER_NAMESPACE: kube-system
TILLER_HISTORY_MAX: 20
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from tiller-token-6bksf (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
tiller-token-6bksf:
Type: Secret (a volume populated by a Secret)
SecretName: tiller-token-6bksf
Optional: false
QoS Class: BestEffort
Node-Selectors: nodePool=master
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 16m default-scheduler Successfully assigned kube-system/tiller-deploy-8cdf857fb-68ckj to ip-172-20-70-32.eu-west-1.compute.internal
Normal Pulling 15m kubelet, ip-172-20-70-32.eu-west-1.compute.internal pulling image "eu.gcr.io/kyma-project/alpine-net:0.2.74"
Normal Pulled 15m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Successfully pulled image "eu.gcr.io/kyma-project/alpine-net:0.2.74"
Normal Created 15m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Created container
Normal Started 15m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Started container
Normal SandboxChanged 3m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.
Normal Pulled 3m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Container image "eu.gcr.io/kyma-project/alpine-net:0.2.74" already present on machine
Normal Created 3m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Created container
Normal Started 3m kubelet, ip-172-20-70-32.eu-west-1.compute.internal Started container
It seems that pod tiller-deploy-8cdf857fb-68ckj waits for pod kube-dns-6b4f4b544c-7zkkl, which has this in the log
Name: kube-dns-6b4f4b544c-7zkkl
Namespace: kube-system
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: k8s-app=kube-dns
pod-template-hash=2609061007
Annotations: prometheus.io/port=10055
prometheus.io/scrape=true
scheduler.alpha.kubernetes.io/critical-pod=
scheduler.alpha.kubernetes.io/tolerations=[{"key":"CriticalAddonsOnly", "operator":"Exists"}]
Status: Pending
IP:
Controlled By: ReplicaSet/kube-dns-6b4f4b544c
Containers:
kubedns:
Image: k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10
Ports: 10053/UDP, 10053/TCP, 10055/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
--config-dir=/kube-dns-config
--dns-port=10053
--domain=cluster.local.
--v=2
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:10054/healthcheck/kubedns delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
Environment:
PROMETHEUS_PORT: 10055
Mounts:
/kube-dns-config from kube-dns-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-wh8cf (ro)
dnsmasq:
Image: k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10
Ports: 53/UDP, 53/TCP
Host Ports: 0/UDP, 0/TCP
Args:
-v=2
-logtostderr
-configDir=/etc/k8s/dns/dnsmasq-nanny
-restartDnsmasq=true
--
-k
--cache-size=1000
--dns-forward-max=150
--no-negcache
--log-facility=-
--server=/cluster.local/127.0.0.1#10053
--server=/in-addr.arpa/127.0.0.1#10053
--server=/in6.arpa/127.0.0.1#10053
Requests:
cpu: 150m
memory: 20Mi
Liveness: http-get http://:10054/healthcheck/dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5
Environment: <none>
Mounts:
/etc/k8s/dns/dnsmasq-nanny from kube-dns-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-wh8cf (ro)
sidecar:
Image: k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10
Port: 10054/TCP
Host Port: 0/TCP
Args:
--v=2
--logtostderr
--probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
--probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
Requests:
cpu: 10m
memory: 20Mi
Liveness: http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-wh8cf (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-dns-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-dns
Optional: true
kube-dns-token-wh8cf:
Type: Secret (a volume populated by a Secret)
SecretName: kube-dns-token-wh8cf
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16m (x9 over 22m) default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 16m default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 16m (x4 over 16m) default-scheduler 0/3 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 15m default-scheduler 0/3 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 10m (x7 over 15m) default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 8m (x2 over 8m) default-scheduler 0/3 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 2 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 4m (x9 over 10m) default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling 10s (x7 over 4m) default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.