kops icon indicating copy to clipboard operation
kops copied to clipboard

Ec2 instances are not showing in the nodes list for kubectl

Open bobanda87 opened this issue 2 years ago • 17 comments

/kind bug

**1. What kops version are you running? The command kops version, will display Version 1.11.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.11.8

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

export KOPS_STATE_STORE=s3://....
export KOPS_CLUSTER_NAME=stage.cluster.example.com
export KOPS_RUN_OBSOLETE_VERSION=true
kops update cluster

5. What happened after the commands executed? Initially I was not able to see ec2 instances in the list of nodes when I kubectl get nodes.

After that I have decided to try to test changing minSize and maxSize parameters in kops configuration and to do kops update cluster. After running this command, I have got this error:

W0602 17:16:48.350262 1505682 launchconfiguration.go:197] unable to resolve image: "kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2021-02-05": could not find Image for "kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2021-02-05"

Then I updated image to ami-0c259a97cbf621daf (since I read that kope.io images are not available anymore). This is Ubuntu 18.04 (which I found here https://cloud-images.ubuntu.com/locator/ec2/)

After that the cluster was updated. I have got the following at the end:

kops has set your kubectl context to stage.cluster.example.com
Cluster changes have been applied to the cloud.
Changes may require instances to restart: kops rolling-update cluster

After that when I run kubectl get nodes I do not see those nodes visible (ec2 instances are created as a part of the AutoScalingGroup)

6. What did you expect to happen? I want to see those nodes visible when I run kubectl get nodes (ec2 instances are created as a part of the AutoScalingGroup)

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

bobanda87 avatar Jun 02 '23 15:06 bobanda87

I suggest you try and look at https://kops.sigs.k8s.io/operations/troubleshoot/ and maybe find what is the issue in instance logs.

hakman avatar Jun 02 '23 16:06 hakman

@hakman Thanks for reply. Not able to actually ssh from bastion to the nodes at this point, after running rolling-update command :/

This is what I can see when I run validate cluster

$ kops validate cluster Validating cluster staging.k8s.example.com

unexpected error during validation: error listing nodes: Get https://api.staging.k8s.example.com/api/v1/nodes: EOF

bobanda87 avatar Jun 05 '23 11:06 bobanda87

Also, I can see this

$ curl https://api.staging.k8s.example.com/api/v1/nodes curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to api.staging.k8s.example.com:443

So, it seems like this got broken during rolling-update

bobanda87 avatar Jun 05 '23 12:06 bobanda87

@bobanda87 You chose an Ubuntu 18.04 image that was experimental at the time of kOps 1.11. You may have more luck with Debian Stretch from https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch.

hakman avatar Jun 05 '23 16:06 hakman

@hakman Thanks for the reply! This used to be my image previously kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17 (exactly as you said, Debian stretch), but this source is not available anymore. Can you recommend a specific instance that could replace it? (eu-west-1 region if that matters)

bobanda87 avatar Jun 05 '23 19:06 bobanda87

That image was kope.io image based on Debian Stretch, not the official Debian one. I cannot recommend any image, just suggest you look at https://wiki.debian.org/Cloud/AmazonEC2Image/Stretch. kOps 1.11 is ancient and I doubt there are many other people still using it.

hakman avatar Jun 05 '23 20:06 hakman

@hakman thanks for the suggestion! I have tried several amis from that list but they didn’t give me anything different then I had on the starting point (f.ex. ami-01f43da22ee0fbf95).

Can you help me navigate the next kops version that is not complicated to update to? Are there any step by step guides for it?

Do you know where I can find working AMIs with that kops version (that you suggest that I upgrade to)? Will that be compatible with kubectl version 1.11?

bobanda87 avatar Jun 06 '23 14:06 bobanda87

This is what I see when I validate the cluster

$ kops validate cluster Validating cluster staging.k8s.example.com

INSTANCE GROUPS NAME ROLE MACHINETYPE MIN MAX SUBNETS bastions Bastion t2.micro 1 1 utility-eu-west-1a,utility-eu-west-1b,utility-eu-west-1c ci Node t3.large 1 6 eu-west-1a,eu-west-1b,eu-west-1c master-eu-west-1a Master t3.medium 1 1 eu-west-1a master-eu-west-1b Master t3.medium 1 1 eu-west-1b master-eu-west-1c Master t3.medium 1 1 eu-west-1c nodes Node t3.large 1 8 eu-west-1a,eu-west-1b,eu-west-1c

NODE STATUS NAME ROLE READY ip-172-20-118-39.eu-west-1.compute.internal master True ip-172-20-48-246.eu-west-1.compute.internal master True ip-172-20-70-32.eu-west-1.compute.internal master True

VALIDATION ERRORS KIND NAME MESSAGE Machine i-02794359d48b221d9 machine "i-02794359d48b221d9" has not yet joined cluster Machine i-03da8ef9491cf4c68 machine "i-03da8ef9491cf4c68" has not yet joined cluster Pod kube-system/calico-node-88j2n kube-system pod "calico-node-88j2n" is not healthy Pod kube-system/tiller-deploy-8cdf857fb-67s6h kube-system pod "tiller-deploy-8cdf857fb-67s6h" is not healthy

Validation Failed

bobanda87 avatar Jun 06 '23 14:06 bobanda87

Try setting the image as 379101102735/debian-stretch-hvm-x86_64-gp2-2022-07-01-66430. If that doesn't work, the only helpful info is ssh and check the kops-configuration service log.

There is no easy upgrade to newer version of kOps after 5 years, sorry :).

hakman avatar Jun 06 '23 16:06 hakman

kOps 1.19 is the newest version of kOps that supports k8s 1.11 and 1.12. But you're probably better off standing up a new cluster and moving your workloads.

johngmyers avatar Jun 06 '23 18:06 johngmyers

So, it looks pretty similar with 379101102735/debian-stretch-hvm-x86_64-gp2-2022-07-01-66430 AMI unfortunately :(

$ kops rolling-update cluster --yes
NAME			STATUS		NEEDUPDATE	READY	MIN	MAX	NODES
bastions		Ready		0		1	1	1	0
ci			NeedsUpdate	1		0	1	6	0
master-eu-west-1a	Ready		0		1	1	1	1
master-eu-west-1b	Ready		0		1	1	1	1
master-eu-west-1c	Ready		0		1	1	1	1
nodes			Ready		0		1	1	8	0
W0606 20:13:59.578007 1694641 instancegroups.go:175] Skipping drain of instance "i-02794359d48b221d9", because it is not registered in kubernetes
W0606 20:13:59.578053 1694641 instancegroups.go:183] no kubernetes Node associated with i-02794359d48b221d9, skipping node deletion
I0606 20:13:59.578075 1694641 instancegroups.go:301] Stopping instance "i-02794359d48b221d9", in group "ci.staging.k8s.snappo.com" (this may take a while).
I0606 20:13:59.888383 1694641 instancegroups.go:198] waiting for 4m0s after terminating instance
I0606 20:17:59.888527 1694641 instancegroups.go:209] Validating the cluster.
I0606 20:18:01.708379 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:18:32.555451 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:19:02.845418 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:19:32.599475 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:20:03.294402 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0decd0372bdf0529e" has not yet joined cluster.
I0606 20:20:32.624184 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
I0606 20:21:02.605659 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
I0606 20:21:32.899162 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
I0606 20:22:02.576949 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0ebdd3f99f7aee0d2" has not yet joined cluster.
I0606 20:22:32.757390 1694641 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-0235e366bde060290" has not yet joined cluster.
E0606 20:23:01.708545 1694641 instancegroups.go:214] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duration of "5m0s"
$ kops validate cluster
Validating cluster staging.k8s.example.com

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
bastions		Bastion	t2.micro	1	1	utility-eu-west-1a,utility-eu-west-1b,utility-eu-west-1c
ci			Node	t3.large	1	6	eu-west-1a,eu-west-1b,eu-west-1c
master-eu-west-1a	Master	t3.medium	1	1	eu-west-1a
master-eu-west-1b	Master	t3.medium	1	1	eu-west-1b
master-eu-west-1c	Master	t3.medium	1	1	eu-west-1c
nodes			Node	t3.large	1	8	eu-west-1a,eu-west-1b,eu-west-1c

NODE STATUS
NAME						ROLE	READY
ip-172-20-118-39.eu-west-1.compute.internal	master	True
ip-172-20-48-246.eu-west-1.compute.internal	master	True
ip-172-20-70-32.eu-west-1.compute.internal	master	True

VALIDATION ERRORS
KIND	NAME						MESSAGE
Machine	i-0235e366bde060290				machine "i-0235e366bde060290" has not yet joined cluster
Machine	i-0ebdd3f99f7aee0d2				machine "i-0ebdd3f99f7aee0d2" has not yet joined cluster
Pod	kube-system/tiller-deploy-8cdf857fb-67s6h	kube-system pod "tiller-deploy-8cdf857fb-67s6h" is not healthy

Validation Failed

bobanda87 avatar Jun 06 '23 20:06 bobanda87

the only helpful info is ssh and check the kops-configuration service log.

Can you be a bit more specific where to ssh and check this configuration? (apologies if it is a dummy question, but I inherited this setup and am not fully familiar with kops :) )

bobanda87 avatar Jun 06 '23 20:06 bobanda87

You would ssh into the instance that isn't joining.

https://kops.sigs.k8s.io/operations/troubleshoot/

johngmyers avatar Jun 06 '23 20:06 johngmyers

I can see this pod tiller-deploy here

Name:               tiller-deploy-8cdf857fb-68ckj
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               ip-172-20-70-32.eu-west-1.compute.internal/172.20.70.32
Start Time:         Wed, 07 Jun 2023 13:07:36 +0300
Labels:             app=helm
                    name=tiller
                    pod-template-hash=478941396
Annotations:        <none>
Status:             Pending
IP:                 100.105.54.165
Controlled By:      ReplicaSet/tiller-deploy-8cdf857fb
Init Containers:
  init-tiller:
    Container ID:  docker://2b7a3ad25aa43a3759254cd54d755494f8967aa6873fb359821094354dcfea11
    Image:         eu.gcr.io/kyma-project/alpine-net:0.2.74
    Image ID:      docker-pullable://eu.gcr.io/kyma-project/alpine-net@sha256:febf714b1d6ff93331406ed972c038e61e393137d923ba0f1d7cdbd1a16fda24
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      until nc -zv kube-dns.kube-system.svc.cluster.local 53; do echo waiting for k8s readiness; sleep 2; done;
    State:          Running
      Started:      Wed, 07 Jun 2023 13:20:02 +0300
    Ready:          False
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from tiller-token-6bksf (ro)
Containers:
  tiller:
    Container ID:   
    Image:          ghcr.io/helm/tiller:v2.13.0
    Image ID:       
    Port:           44134/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:44135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:44135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      TILLER_NAMESPACE:    kube-system
      TILLER_HISTORY_MAX:  20
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from tiller-token-6bksf (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tiller-token-6bksf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tiller-token-6bksf
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  nodePool=master
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason          Age   From                                                 Message
  ----    ------          ----  ----                                                 -------
  Normal  Scheduled       16m   default-scheduler                                    Successfully assigned kube-system/tiller-deploy-8cdf857fb-68ckj to ip-172-20-70-32.eu-west-1.compute.internal
  Normal  Pulling         15m   kubelet, ip-172-20-70-32.eu-west-1.compute.internal  pulling image "eu.gcr.io/kyma-project/alpine-net:0.2.74"
  Normal  Pulled          15m   kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Successfully pulled image "eu.gcr.io/kyma-project/alpine-net:0.2.74"
  Normal  Created         15m   kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Created container
  Normal  Started         15m   kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Started container
  Normal  SandboxChanged  3m    kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.
  Normal  Pulled          3m    kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Container image "eu.gcr.io/kyma-project/alpine-net:0.2.74" already present on machine
  Normal  Created         3m    kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Created container
  Normal  Started         3m    kubelet, ip-172-20-70-32.eu-west-1.compute.internal  Started container

bobanda87 avatar Jun 07 '23 11:06 bobanda87

It seems that pod tiller-deploy-8cdf857fb-68ckj waits for pod kube-dns-6b4f4b544c-7zkkl, which has this in the log

Name:               kube-dns-6b4f4b544c-7zkkl
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             k8s-app=kube-dns
                    pod-template-hash=2609061007
Annotations:        prometheus.io/port=10055
                    prometheus.io/scrape=true
                    scheduler.alpha.kubernetes.io/critical-pod=
                    scheduler.alpha.kubernetes.io/tolerations=[{"key":"CriticalAddonsOnly", "operator":"Exists"}]
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/kube-dns-6b4f4b544c
Containers:
  kubedns:
    Image:       k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10
    Ports:       10053/UDP, 10053/TCP, 10055/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      --config-dir=/kube-dns-config
      --dns-port=10053
      --domain=cluster.local.
      --v=2
    Limits:
      memory:  170Mi
    Requests:
      cpu:      100m
      memory:   70Mi
    Liveness:   http-get http://:10054/healthcheck/kubedns delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:  http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
    Environment:
      PROMETHEUS_PORT:  10055
    Mounts:
      /kube-dns-config from kube-dns-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-wh8cf (ro)
  dnsmasq:
    Image:       k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10
    Ports:       53/UDP, 53/TCP
    Host Ports:  0/UDP, 0/TCP
    Args:
      -v=2
      -logtostderr
      -configDir=/etc/k8s/dns/dnsmasq-nanny
      -restartDnsmasq=true
      --
      -k
      --cache-size=1000
      --dns-forward-max=150
      --no-negcache
      --log-facility=-
      --server=/cluster.local/127.0.0.1#10053
      --server=/in-addr.arpa/127.0.0.1#10053
      --server=/in6.arpa/127.0.0.1#10053
    Requests:
      cpu:        150m
      memory:     20Mi
    Liveness:     http-get http://:10054/healthcheck/dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/k8s/dns/dnsmasq-nanny from kube-dns-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-wh8cf (ro)
  sidecar:
    Image:      k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10
    Port:       10054/TCP
    Host Port:  0/TCP
    Args:
      --v=2
      --logtostderr
      --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
      --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
    Requests:
      cpu:        10m
      memory:     20Mi
    Liveness:     http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-dns-token-wh8cf (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-dns-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-dns
    Optional:  true
  kube-dns-token-wh8cf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-dns-token-wh8cf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  16m (x9 over 22m)   default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  16m                 default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  16m (x4 over 16m)   default-scheduler  0/3 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 2 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  15m                 default-scheduler  0/3 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 2 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  10m (x7 over 15m)   default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  8m (x2 over 8m)     default-scheduler  0/3 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 2 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  4m (x9 over 10m)    default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  10s (x7 over 4m)    default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.


  

bobanda87 avatar Jun 07 '23 12:06 bobanda87

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 22 '24 01:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 21 '24 01:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 22 '24 02:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 22 '24 02:03 k8s-ci-robot