cloudstack
cloudstack copied to clipboard
Unable to deploy Kubernetes cluster for v 1.26.0 or above with Cloudstack 4.17.0
ISSUE TYPE
- Bug Report
COMPONENT NAME
UI
CLOUDSTACK VERSION
4.17.0
CONFIGURATION
OS / ENVIRONMENT
N/A
SUMMARY
I have an existing Cloudstack setup with version 4.17.0 I uploaded some of the Kubernetes ISOs listed on http://download.cloudstack.org/cks/
I am able to use the GUI to create a Kuebernetes cluster for version 1.23.3 and 1.25.0, login to Kubernetes Dashboard etc.
However, when I try to create the cluster for 1.26.0 or higher it never completes and eventually times out. I can see that the network is created, and the control and node servers are spun up, but nothing beyond that.
Dont see any usable errors on the Cloudstack log either - just shows
Failed to setup Kubernetes cluster : Cluster_Name in usable state as unable to provision API endpoint for the cluster
STEPS TO REPRODUCE
Register Kubernetes ISO, use GUI to create new cluster for version 1.26.0 or higher.
The cluster creation times out after 60 minutes.
EXPECTED RESULTS
Cluster is in Running state
ACTUAL RESULTS
I see network created, I see control and node servers created. However, cluster fails to create with error
`Failed to setup Kubernetes cluster : Cluster_Name in usable state as unable to provision API endpoint for the cluster`
SystemVM version is 4.17.0
Any advice on how to troubleshoot this would be greatly appreciated.
@nxsbi Could you please check if you have set the global setting “Endpoint url (endpoint.url)” to management server ip addrees.
Hi @kiranchavala - yes the global setting is correctly set as the URL for the internet facing API path. It is not the IP though.
Https://portal.site.com/client/api
where site is the actual site
@nxsbi Thanks
Could you try replacing with the Ip Address and cross check.
Also what is the Mangement server OS and the Hypervisor OS you are currently
Please perform these steps to get the logs
- Login to the management server
Login to the control node
ssh -i <ssh-private.key > -p 2222 cloud@<Public ip address of Virtual Router>
Example
ssh -i /var/lib/cloudstack/management/.ssh/id_rsa -p 2222 [email protected]
- Switch to root user
cloud@k1-control-18d9cc9f10d:~$ sudo su - root@k1-control-18d9cc9f10d:~#
- Execute the Kubectl command
root@k1-control-18d9cc9f10d:~# cd /opt/cloud/bin
Make sure the nodes are in running state
root@k1-control-18d9cc9f10d:/opt/cloud/bin# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k1-control-18d9cc9f10d Ready control-plane 8m37s v1.28.4
k1-node-18d9cca4fad Ready <none> 8m22s v1.28.4
Make sure all pods are in running state
root@k1-control-18d9cc9f10d:/opt/cloud/bin# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cloud-controller-manager-574bcb86c-tz9cj 1/1 Running 0 8m36s
kube-system coredns-5dd5756b68-245tn 1/1 Running 0 9m12s
kube-system coredns-5dd5756b68-jplbr 1/1 Running 0 9m12s
kube-system etcd-k1-control-18d9cc9f10d 1/1 Running 0 9m19s
kube-system kube-apiserver-k1-control-18d9cc9f10d 1/1 Running 0 9m15s
kube-system kube-controller-manager-k1-control-18d9cc9f10d 1/1 Running 0 9m15s
kube-system kube-proxy-4qq2h 1/1 Running 0 9m5s
kube-system kube-proxy-jfq7k 1/1 Running 0 9m12s
kube-system kube-scheduler-k1-control-18d9cc9f10d 1/1 Running 0 9m19s
kube-system weave-net-77lcj 2/2 Running 1 (9m9s ago) 9m12s
kube-system weave-net-k8cnk 2/2 Running 0 9m5s
kubernetes-dashboard dashboard-metrics-scraper-5657497c4c-gmt6m 1/1 Running 0 9m12s
kubernetes-dashboard kubernetes-dashboard-5b749d9495-vth8j 1/1 Running 0 9m12s
- Please provide the following logs to investigate which step its failing
cat /var/log/daemon.log
cat /var/log/messages
Similarly, log into the worker node from the management server change the port to 2223
ssh -i <ssh-private.key > -p 222 cloud@<Public ip address of Virtual Router>
ssh -i /var/lib/cloudstack/management/.ssh/id_rsa -p 2223 [email protected]
cat /var/log/daemon.log
cat /var/log/messages
Hello @kiranchavala
Finally got back to this.. I logged into the control node. NOTE I am on 4.17.0 - is there any SystemVM level changes due to which this is not working--
kubectl get nodes (same message for kubectl get pods --all-namespaces)
root@K120-control-18e3f28025e:/opt/cloud/bin# kubectl get nodes
E0314 23:22:12.988514 197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.989740 197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.990727 197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.993205 197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
E0314 23:22:12.994831 197290 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?
root@K120-control-18e3f28025e:/opt/cloud/bin#
So after this, I looked into the /var/log/daemon.log - which shows meaningful error messages. Relevant content from - /var/log/daemon.log below. However, I have no clue how to fix this. I did Google searches, and saw this https://github.com/containerd/containerd/discussions/8033 - which asks to check /etc/containerd/config.toml if cri is disabled - however it is not in the disabled list.
Mar 14 23:18:56 systemvm systemd[1]: deploy-kube-system.service: Scheduled restart job, restart counter is at 2094.
Mar 14 23:18:56 systemvm systemd[1]: Stopped deploy-kube-system.service.
Mar 14 23:18:56 systemvm systemd[1]: Started deploy-kube-system.service.
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: W0314 23:18:56.560147 175532 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: [init] Using Kubernetes version: v1.27.8
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: [preflight] Running pre-flight checks
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: error execution phase preflight: [preflight] Some fatal errors occurred:
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: #011[ERROR CRI]: container runtime is not running: output: time="2024-03-14T23:18:56Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: , error: exit status 1
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
Mar 14 23:18:56 systemvm deploy-kube-system[175532]: To see the stack trace of this error execute with --v=5 or higher
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: W0314 23:18:56.727208 175560 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: [init] Using Kubernetes version: v1.27.8
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: [preflight] Running pre-flight checks
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: error execution phase preflight: [preflight] Some fatal errors occurred:
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: #011[ERROR CRI]: container runtime is not running: output: time="2024-03-14T23:18:56Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: , error: exit status 1
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
Mar 14 23:18:56 systemvm deploy-kube-system[175560]: To see the stack trace of this error execute with --v=5 or higher
Mar 14 23:18:56 systemvm deploy-kube-system[175587]: W0314 23:18:56.893921 175587 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
Mar 14 23:18:56 systemvm deploy-kube-system[175587]: [init] Using Kubernetes version: v1.27.8
Mar 14 23:18:56 systemvm deploy-kube-system[175587]: [preflight] Running pre-flight checks
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: error execution phase preflight: [preflight] Some fatal errors occurred:
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: #011[ERROR CRI]: container runtime is not running: output: time="2024-03-14T23:18:56Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: , error: exit status 1
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
Mar 14 23:18:57 systemvm deploy-kube-system[175587]: To see the stack trace of this error execute with --v=5 or higher
Mar 14 23:18:57 systemvm deploy-kube-system[175531]: Error: kubeadm init failed!
Mar 14 23:18:57 systemvm systemd[1]: deploy-kube-system.service: Main process exited, code=exited, status=1/FAILURE
Mar 14 23:18:57 systemvm systemd[1]: deploy-kube-system.service: Failed with result 'exit-code'.
Mar 14 23:18:57 systemvm systemd[1]: deploy-kube-system.service: Scheduled restart job, restart counter is at 2095.
Mar 14 23:18:57 systemvm systemd[1]: Stopped deploy-kube-system.service.
Mar 14 23:18:57 systemvm systemd[1]: Started deploy-kube-system.service.
@nxsbi
Could you please check your kube config file present at this location on the control node
root@gh-control-18e40c81d05:~/root/.kube
cat config
Make sure the server ip address points to the router's public ip address
server: https://10.0.57.164:6443
@nxsbi what's the containerd version ? it might be unsupported by k8s 1.26+
refert to https://containerd.io/releases/#kubernetes-support
@weizhouapache - You are right - the containerd version was 1.5 I manually updated to containerd (using apt) in control and node and it updated to version 1.6.28
After this, the control plane is available. However the node still has errors - it seems the manifest never got created there. See the daemon.log from node
However, I should not have to do this manually - is this related to systemvm version as 4.17.0? I have not yet upgraded to newer version.
The path /etc/kubernetes/manifest is not present in the node server at all.
@weizhouapache - You are right - the containerd version was 1.5 I manually updated to containerd (using apt) in control and node and it updated to version 1.6.28
After this, the control plane is available. However the node still has errors - it seems the manifest never got created there. See the daemon.log from node
However, I should not have to do this manually - is this related to systemvm version as 4.17.0? I have not yet upgraded to newer version.
Yes, the systemvm template for 4.17.0 is too old.
![]()
The path
/etc/kubernetes/manifestis not present in the node server at all.
It is just a message can be ignored.
4.17 is EOL, please upgrade to 4.19 or 4.18
@weizhouapache - I upgraded the SystemVM to 4.19 -- making progress. So now the cluster started up. I was able to download the kubeconfig file, and see the below. However, I am not able to retrieve the Token to access the dashboard. In the older cluster (v1.25) I can see the token, but here it just shows the output below - it seems its masking it. I also tried variations of the command but unable to get the token
kubectl get nodes
NAME STATUS ROLES AGE VERSION
k122-control-18e43529c69 Ready control-plane 84m v1.27.8
k122-node-18e4353e392 Ready <none> 84m v1.27.8
./kubectl --kubeconfig k122.conf get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cloud-controller-manager-5b8fc87665-6n5xb 1/1 Running 0 78m
kube-system coredns-5d78c9869d-z2zdh 1/1 Running 0 79m
kube-system coredns-5d78c9869d-zvgq9 1/1 Running 0 79m
kube-system etcd-k122-control-18e43529c69 1/1 Running 0 79m
kube-system kube-apiserver-k122-control-18e43529c69 1/1 Running 0 79m
kube-system kube-controller-manager-k122-control-18e43529c69 1/1 Running 0 79m
kube-system kube-proxy-vjxxj 1/1 Running 0 79m
kube-system kube-proxy-vn4ql 1/1 Running 0 79m
kube-system kube-scheduler-k122-control-18e43529c69 1/1 Running 0 79m
kube-system weave-net-rvcql 2/2 Running 0 79m
kube-system weave-net-znt5x 2/2 Running 1 (79m ago) 79m
kubernetes-dashboard dashboard-metrics-scraper-5cb4f4bb9c-8f2g5 1/1 Running 0 79m
kubernetes-dashboard kubernetes-dashboard-6bccb5f4cc-tphwg 1/1 Running 0 79m
./kubectl --kubeconfig k122.conf describe secret $(./kubectl --kubeconfig k122.conf get secrets -n kubernetes-dashboard | grep kubernetes-dashboard-token | awk '{print $1}') -n kubernetes-dashboard
Name: kubernetes-dashboard-certs
Namespace: kubernetes-dashboard
Labels: k8s-app=kubernetes-dashboard
Annotations: <none>
Type: Opaque
Data
====
Name: kubernetes-dashboard-csrf
Namespace: kubernetes-dashboard
Labels: k8s-app=kubernetes-dashboard
Annotations: <none>
Type: Opaque
Data
====
csrf: 256 bytes
Name: kubernetes-dashboard-key-holder
Namespace: kubernetes-dashboard
Labels: <none>
Annotations: <none>
Type: Opaque
Data
====
priv: 1679 bytes
pub: 459 bytes
@nxsbi please refer to #7764
@weizhouapache - I upgraded the SystemVM to 4.19 -- making progress. So now the cluster started up. I was able to download the kubeconfig file, and see the below. However, I am not able to retrieve the Token to access the dashboard. In the older cluster (v1.25) I can see the token, but here it just shows the output below - it seems its masking it. I also tried variations of the command but unable to get the token
Awesome - thanks for all the help! I am able to get in!