cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Unable to deploy Kubernetes cluster for v 1.26.0 or above with Cloudstack 4.17.0

Open nxsbi opened this issue 1 year ago • 3 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME
UI
CLOUDSTACK VERSION
4.17.0
CONFIGURATION
OS / ENVIRONMENT

N/A

SUMMARY

I have an existing Cloudstack setup with version 4.17.0 I uploaded some of the Kubernetes ISOs listed on http://download.cloudstack.org/cks/

I am able to use the GUI to create a Kuebernetes cluster for version 1.23.3 and 1.25.0, login to Kubernetes Dashboard etc.

However, when I try to create the cluster for 1.26.0 or higher it never completes and eventually times out. I can see that the network is created, and the control and node servers are spun up, but nothing beyond that. Dont see any usable errors on the Cloudstack log either - just shows Failed to setup Kubernetes cluster : Cluster_Name in usable state as unable to provision API endpoint for the cluster

STEPS TO REPRODUCE
Register Kubernetes ISO, use GUI to create new cluster for version 1.26.0 or higher. 

The cluster creation times out after 60 minutes. 

EXPECTED RESULTS
Cluster is in Running state
ACTUAL RESULTS
I see network created, I see control and node servers created. However, cluster fails to create with error
`Failed to setup Kubernetes cluster : Cluster_Name in usable state as unable to provision API endpoint for the cluster`

SystemVM version is 4.17.0

Any advice on how to troubleshoot this would be greatly appreciated. 


nxsbi avatar Feb 20 '24 06:02 nxsbi

@nxsbi Could you please check if you have set the global setting “Endpoint url (endpoint.url)” to management server ip addrees.

kiranchavala avatar Feb 20 '24 06:02 kiranchavala

Hi @kiranchavala - yes the global setting is correctly set as the URL for the internet facing API path. It is not the IP though.

Https://portal.site.com/client/api

where site is the actual site

nxsbi avatar Feb 20 '24 07:02 nxsbi

@nxsbi Thanks

Could you try replacing with the Ip Address and cross check.

Also what is the Mangement server OS and the Hypervisor OS you are currently

Please perform these steps to get the logs

  1. Login to the management server

Login to the control node

ssh -i <ssh-private.key > -p 2222 cloud@<Public ip address of Virtual Router>

Example

ssh -i /var/lib/cloudstack/management/.ssh/id_rsa -p 2222 [email protected]

  1. Switch to root user

cloud@k1-control-18d9cc9f10d:~$ sudo su - root@k1-control-18d9cc9f10d:~#

  1. Execute the Kubectl command

root@k1-control-18d9cc9f10d:~# cd /opt/cloud/bin

Make sure the nodes are in running state


root@k1-control-18d9cc9f10d:/opt/cloud/bin# kubectl get nodes

NAME                     STATUS   ROLES           AGE     VERSION
k1-control-18d9cc9f10d   Ready    control-plane   8m37s   v1.28.4
k1-node-18d9cca4fad      Ready    <none>          8m22s   v1.28.4

Make sure all pods are in running state

root@k1-control-18d9cc9f10d:/opt/cloud/bin# kubectl get pods --all-namespaces
NAMESPACE              NAME                                             READY   STATUS    RESTARTS       AGE
kube-system            cloud-controller-manager-574bcb86c-tz9cj         1/1     Running   0              8m36s
kube-system            coredns-5dd5756b68-245tn                         1/1     Running   0              9m12s
kube-system            coredns-5dd5756b68-jplbr                         1/1     Running   0              9m12s
kube-system            etcd-k1-control-18d9cc9f10d                      1/1     Running   0              9m19s
kube-system            kube-apiserver-k1-control-18d9cc9f10d            1/1     Running   0              9m15s
kube-system            kube-controller-manager-k1-control-18d9cc9f10d   1/1     Running   0              9m15s
kube-system            kube-proxy-4qq2h                                 1/1     Running   0              9m5s
kube-system            kube-proxy-jfq7k                                 1/1     Running   0              9m12s
kube-system            kube-scheduler-k1-control-18d9cc9f10d            1/1     Running   0              9m19s
kube-system            weave-net-77lcj                                  2/2     Running   1 (9m9s ago)   9m12s
kube-system            weave-net-k8cnk                                  2/2     Running   0              9m5s
kubernetes-dashboard   dashboard-metrics-scraper-5657497c4c-gmt6m       1/1     Running   0              9m12s
kubernetes-dashboard   kubernetes-dashboard-5b749d9495-vth8j            1/1     Running   0              9m12s
  1. Please provide the following logs to investigate which step its failing

cat /var/log/daemon.log

cat /var/log/messages


Similarly, log into the worker node from the management server change the port to 2223

ssh -i <ssh-private.key > -p 222 cloud@<Public ip address of Virtual Router>

ssh -i /var/lib/cloudstack/management/.ssh/id_rsa -p 2223 [email protected]

cat /var/log/daemon.log

cat /var/log/messages

kiranchavala avatar Feb 20 '24 08:02 kiranchavala

Hello @kiranchavala

Finally got back to this.. I logged into the control node. NOTE I am on 4.17.0 - is there any SystemVM level changes due to which this is not working--

kubectl get nodes (same message for kubectl get pods --all-namespaces)

root@K120-control-18e3f28025e:/opt/cloud/bin# kubectl get nodes
E0314 23:22:12.988514  197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.989740  197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.990727  197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.993205  197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.994831  197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?
root@K120-control-18e3f28025e:/opt/cloud/bin#

So after this, I looked into the /var/log/daemon.log - which shows meaningful error messages. Relevant content from - /var/log/daemon.log below. However, I have no clue how to fix this. I did Google searches, and saw this https://github.com/containerd/containerd/discussions/8033 - which asks to check /etc/containerd/config.toml if cri is disabled - however it is not in the disabled list.

Mar 14 23:18:56 systemvm systemd[1]: deploy-kube-system.service: Scheduled restart job, restart counter is at 2094.
Mar 14 23:18:56 systemvm systemd[1]: Stopped deploy-kube-system.service.
Mar 14 23:18:56 systemvm systemd[1]: Started deploy-kube-system.service.
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: W0314 23:18:56.560147  175532 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: [init] Using Kubernetes version: v1.27.8
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: [preflight] Running pre-flight checks
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: error execution phase preflight: [preflight] Some fatal errors occurred:
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: #011[ERROR CRI]: container runtime is not running: output: time="2024-03-14T23:18:56Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: , error: exit status 1
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: To see the stack trace of this error execute with --v=5 or higher
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: W0314 23:18:56.727208  175560 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: [init] Using Kubernetes version: v1.27.8
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: [preflight] Running pre-flight checks
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: error execution phase preflight: [preflight] Some fatal errors occurred:
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: #011[ERROR CRI]: container runtime is not running: output: time="2024-03-14T23:18:56Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: , error: exit status 1
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: To see the stack trace of this error execute with --v=5 or higher
Mar 14 23:18:56 systemvm deploy-kube-system[175587]: W0314 23:18:56.893921  175587 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
Mar 14 23:18:56 systemvm deploy-kube-system[175587]: [init] Using Kubernetes version: v1.27.8
Mar 14 23:18:56 systemvm deploy-kube-system[175587]: [preflight] Running pre-flight checks
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: error execution phase preflight: [preflight] Some fatal errors occurred:
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: #011[ERROR CRI]: container runtime is not running: output: time="2024-03-14T23:18:56Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: , error: exit status 1
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: To see the stack trace of this error execute with --v=5 or higher
Mar 14 23:18:57 systemvm deploy-kube-system[175531]: Error: kubeadm init failed!
Mar 14 23:18:57 systemvm systemd[1]: deploy-kube-system.service: Main process exited, code=exited, status=1/FAILURE
Mar 14 23:18:57 systemvm systemd[1]: deploy-kube-system.service: Failed with result 'exit-code'.
Mar 14 23:18:57 systemvm systemd[1]: deploy-kube-system.service: Scheduled restart job, restart counter is at 2095.
Mar 14 23:18:57 systemvm systemd[1]: Stopped deploy-kube-system.service.
Mar 14 23:18:57 systemvm systemd[1]: Started deploy-kube-system.service.

nxsbi avatar Mar 14 '24 23:03 nxsbi

@nxsbi

Could you please check your kube config file present at this location on the control node

root@gh-control-18e40c81d05:~/root/.kube

cat config

Make sure the server ip address points to the router's public ip address

server: https://10.0.57.164:6443

Screenshot 2024-03-15 at 12 23 46 PM

kiranchavala avatar Mar 15 '24 06:03 kiranchavala

@nxsbi what's the containerd version ? it might be unsupported by k8s 1.26+

refert to https://containerd.io/releases/#kubernetes-support image

weizhouapache avatar Mar 15 '24 07:03 weizhouapache

@weizhouapache - You are right - the containerd version was 1.5 I manually updated to containerd (using apt) in control and node and it updated to version 1.6.28

After this, the control plane is available. However the node still has errors - it seems the manifest never got created there. See the daemon.log from node

However, I should not have to do this manually - is this related to systemvm version as 4.17.0? I have not yet upgraded to newer version.

image

The path /etc/kubernetes/manifest is not present in the node server at all.

nxsbi avatar Mar 15 '24 16:03 nxsbi

@weizhouapache - You are right - the containerd version was 1.5 I manually updated to containerd (using apt) in control and node and it updated to version 1.6.28

After this, the control plane is available. However the node still has errors - it seems the manifest never got created there. See the daemon.log from node

However, I should not have to do this manually - is this related to systemvm version as 4.17.0? I have not yet upgraded to newer version.

Yes, the systemvm template for 4.17.0 is too old.

image

The path /etc/kubernetes/manifest is not present in the node server at all.

It is just a message can be ignored.

4.17 is EOL, please upgrade to 4.19 or 4.18

weizhouapache avatar Mar 15 '24 16:03 weizhouapache

@weizhouapache - I upgraded the SystemVM to 4.19 -- making progress. So now the cluster started up. I was able to download the kubeconfig file, and see the below. However, I am not able to retrieve the Token to access the dashboard. In the older cluster (v1.25) I can see the token, but here it just shows the output below - it seems its masking it. I also tried variations of the command but unable to get the token


 kubectl get nodes
NAME                       STATUS   ROLES           AGE   VERSION
k122-control-18e43529c69   Ready    control-plane   84m   v1.27.8
k122-node-18e4353e392      Ready    <none>          84m   v1.27.8


 ./kubectl --kubeconfig k122.conf get pods --all-namespaces
NAMESPACE              NAME                                               READY   STATUS    RESTARTS      AGE
kube-system            cloud-controller-manager-5b8fc87665-6n5xb          1/1     Running   0             78m
kube-system            coredns-5d78c9869d-z2zdh                           1/1     Running   0             79m
kube-system            coredns-5d78c9869d-zvgq9                           1/1     Running   0             79m
kube-system            etcd-k122-control-18e43529c69                      1/1     Running   0             79m
kube-system            kube-apiserver-k122-control-18e43529c69            1/1     Running   0             79m
kube-system            kube-controller-manager-k122-control-18e43529c69   1/1     Running   0             79m
kube-system            kube-proxy-vjxxj                                   1/1     Running   0             79m
kube-system            kube-proxy-vn4ql                                   1/1     Running   0             79m
kube-system            kube-scheduler-k122-control-18e43529c69            1/1     Running   0             79m
kube-system            weave-net-rvcql                                    2/2     Running   0             79m
kube-system            weave-net-znt5x                                    2/2     Running   1 (79m ago)   79m
kubernetes-dashboard   dashboard-metrics-scraper-5cb4f4bb9c-8f2g5         1/1     Running   0             79m
kubernetes-dashboard   kubernetes-dashboard-6bccb5f4cc-tphwg              1/1     Running   0             79m



./kubectl --kubeconfig k122.conf describe secret $(./kubectl --kubeconfig k122.conf get secrets -n kubernetes-dashboard | grep kubernetes-dashboard-token | awk '{print $1}') -n kubernetes-dashboard
Name:         kubernetes-dashboard-certs
Namespace:    kubernetes-dashboard
Labels:       k8s-app=kubernetes-dashboard
Annotations:  <none>

Type:  Opaque

Data
====


Name:         kubernetes-dashboard-csrf
Namespace:    kubernetes-dashboard
Labels:       k8s-app=kubernetes-dashboard
Annotations:  <none>

Type:  Opaque

Data
====
csrf:  256 bytes


Name:         kubernetes-dashboard-key-holder
Namespace:    kubernetes-dashboard
Labels:       <none>
Annotations:  <none>

Type:  Opaque

Data
====
priv:  1679 bytes
pub:   459 bytes
                                                                       

nxsbi avatar Mar 15 '24 19:03 nxsbi

@nxsbi please refer to #7764

@weizhouapache - I upgraded the SystemVM to 4.19 -- making progress. So now the cluster started up. I was able to download the kubeconfig file, and see the below. However, I am not able to retrieve the Token to access the dashboard. In the older cluster (v1.25) I can see the token, but here it just shows the output below - it seems its masking it. I also tried variations of the command but unable to get the token

weizhouapache avatar Mar 15 '24 19:03 weizhouapache

Awesome - thanks for all the help! I am able to get in!

nxsbi avatar Mar 15 '24 20:03 nxsbi