cloudstack-kubernetes-provider
cloudstack-kubernetes-provider copied to clipboard
Unable to auto-scale Kubernetes cluster
Hi!
I am unable to auto-scale Kubernetes clusters. As I understand, it create a "cluster-autoscaler" deployment that decides whether to scale or not. However, it does not seem to work, since it logs multiple errors and warnings in the pod, even though it is a completely clean cluster.
Normal scaling seems to work just fine.
Setup
A "default" CloudStack setup 4.18 running KVMs.
Settings (relevant)
- Cloud kubernetes service enabled true
- Cloud kubernetes cluster experimental features enabled true
- Cloud kubernetes cluster max size 50
The nodes uses the following service offering:
- 2 CPU x 2.05 Ghz
- 2048 MB memory
- 8 GB root disk
Replicate
-
Create a new cluster using Kubernets 1.24 ISO found here: http://download.cloudstack.org/cks/
-
Enable forced auto-scaling Since the cluster starts with only one worker node, auto-scaling with 3-5 nodes should trigger an upscale (I assume)
-
Check the logs for cluster-autoscaler in the Kubernetes cluster Some notable entries:
E0807 14:41:30.317148 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIDriver: failed to list *v1.CSIDriver: csidrivers.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csidrivers" in API group "storage.k8s.io" at the cluster scope
E0807 14:41:32.388828 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.CSIStorageCapacity: failed to list *v1beta1.CSIStorageCapacity: csistoragecapacities.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "csistoragecapacities" in API group "storage.k8s.io" at the cluster scope
Even though I have not edited anything myself (just a clean CKS cluster), I get these weird logs:
W0807 14:41:43.251280 1 clusterstate.go:590] Failed to get nodegroup for 6a4c91a3-9694-4596-9ddd-dc86e60136ff: Unable to find node 6a4c91a3-9694-4596-9ddd-dc86e60136ff in cluster
W0807 14:41:43.251361 1 clusterstate.go:590] Failed to get nodegroup for bd0b855f-6dc6-4678-9bea-b52329333024: Unable to find node bd0b855f-6dc6-4678-9bea-b52329333024 in cluster
I0807 14:57:06.667061 1 static_autoscaler.go:341] 2 unregistered nodes present
The IDs are correct in CloudStack
The entire log: logs-from-cluster-autoscaler-in-cluster-autoscaler-5bf887ddd8-hxg2g.log
Please tell me if you need more logs to look at, or if I should try some other configuration.
Thanks!
cc @Pearl1594 @weizhouapache @DaanHoogland pl help triage when you've time
@saffronjam this looks like an issue with k8s autoscaler (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloudstack/README.md) or with CKS (upstream https://github.com/apache/cloudstack)
Hi @saffronjam
The autoscaling feature works fine on a k8s cluster deployed by CKS.
Please find the steps that i have followed
After you enable autoscaling on the cluster
Make sure the autoscaling pod is deployed in the cluster
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cluster-autoscaler-8d8894d6c-q8r4h 1/1 Running 0 19m
Before scaling
➜ ~ k get nodes -A
NAME STATUS ROLES AGE VERSION
gh-control-18debd77e18 Ready control-plane 10h v1.28.4
gh-node-18debd8440c Ready <none> 10h v1.28.4
Deploy a application
kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -- /agnhost netexec --http-port=80
➜ ~ k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-node-7c6c5fb9d8-bgd69 1/1 Running 0 10h
Scale the application
kubectl scale --replicas=150 deployment/hello-node
logs from the autoscaler pod
I0228 04:51:46.798087 1 reflector.go:536] /home/djumani/lab/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 9 items received
I0228 04:51:51.244004 1 static_autoscaler.go:235] Starting main loop
I0228 04:51:51.244382 1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=listKubernetesClusters&id=14b42c5d-e7e6-4c41-b638-5facb98b0a93&response=json&signature=***
I0228 04:51:51.279721 1 client.go:175] NewAPIRequest response status code:200
I0228 04:51:51.280798 1 cloudstack_manager.go:88] Got cluster : &{14b42c5d-e7e6-4c41-b638-5facb98b0a93 gh 2 3 1 1 [0xc0013bfad0 0xc0013bfb00] map[gh-control-18debd77e18:0xc0013bfad0 gh-node-18debd8440c:0xc0013bfb00]}
W0228 04:51:51.292009 1 clusterstate.go:590] Failed to get nodegroup for dc95f481-15a3-4629-bb78-055fbe4a7139: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
W0228 04:51:51.292052 1 clusterstate.go:590] Failed to get nodegroup for facdd040-53fe-4984-8654-c186a7cdde9b: Unable to find node facdd040-53fe-4984-8654-c186a7cdde9b in cluster
I0228 04:51:51.292095 1 static_autoscaler.go:341] 2 unregistered nodes present
I0228 04:51:51.292105 1 static_autoscaler.go:624] Removing unregistered node dc95f481-15a3-4629-bb78-055fbe4a7139
W0228 04:51:51.292126 1 static_autoscaler.go:627] Failed to get node group for dc95f481-15a3-4629-bb78-055fbe4a7139: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
W0228 04:51:51.292137 1 static_autoscaler.go:346] Failed to remove unregistered nodes: Unable to find node dc95f481-15a3-4629-bb78-055fbe4a7139 in cluster
I0228 04:51:51.292569 1 filter_out_schedulable.go:65] Filtering out schedulables
I0228 04:51:51.292590 1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I0228 04:51:51.624523 1 filter_out_schedulable.go:175] 44 pods were kept as unschedulable based on caching
I0228 04:51:51.624568 1 filter_out_schedulable.go:176] 0 pods marked as unschedulable can be scheduled.
I0228 04:51:51.624667 1 filter_out_schedulable.go:87] No schedulable pods
I0228 04:51:51.870314 1 static_autoscaler.go:480] Calculating unneeded nodes
I0228 04:51:51.870353 1 pre_filtering_processor.go:66] Skipping gh-control-18debd77e18 - node group min size reached
I0228 04:51:51.870361 1 pre_filtering_processor.go:66] Skipping gh-node-18debd8440c - node group min size reached
I0228 04:51:51.870413 1 static_autoscaler.go:534] Scale down status: unneededOnly=false lastScaleUpTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 lastScaleDownDeleteTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 lastScaleDownFailTime=2024-02-27 17:39:33.032517071 +0000 UTC m=-3594.093061754 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0228 04:52:22.258545 1 scale_up.go:468] Best option to resize: 14b42c5d-e7e6-4c41-b638-5facb98b0a93
I0228 04:52:22.258602 1 scale_up.go:472] Estimated 1 nodes needed in 14b42c5d-e7e6-4c41-b638-5facb98b0a93
I0228 04:52:22.266675 1 scale_up.go:595] Final scale-up plan: [{14b42c5d-e7e6-4c41-b638-5facb98b0a93 1->2 (max: 3)}]
I0228 04:52:22.266915 1 scale_up.go:691] Scale-up: setting group 14b42c5d-e7e6-4c41-b638-5facb98b0a93 size to 2
I0228 04:52:22.267040 1 cloudstack_node_group.go:57] Increase Cluster : 14b42c5d-e7e6-4c41-b638-5facb98b0a93 by 1
I0228 04:52:22.267238 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"3e317689-2939-4a66-b764-b2bb938c433c", APIVersion:"v1", ResourceVersion:"75712", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group 14b42c5d-e7e6-4c41-b638-5facb98b0a93 size to 2 instead of 1 (max: 3)
I0228 04:52:22.267350 1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=scaleKubernetesCluster&id=14b42c5d-e7e6-4c41-b638-5facb98b0a93&response=json&size=2&signature=***
I0228 04:52:22.297307 1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:28.385682 1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 10 items received
I0228 04:52:32.324971 1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=queryAsyncJobResult&jobid=4e62a5a3-825c-435e-a6df-c22e756ee5e4&response=json&signature=***
I0228 04:52:32.346120 1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:32.360171 1 client.go:110] Still waiting for job 4e62a5a3-825c-435e-a6df-c22e756ee5e4 to complete
I0228 04:52:33.993372 1 reflector.go:536] /home/djumani/lab/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 306 items received
I0228 04:52:42.328416 1 client.go:169] NewAPIRequest API request URL:http://10.0.34.2:8080/client/api?apiKey=***&command=queryAsyncJobResult&jobid=4e62a5a3-825c-435e-a6df-c22e756ee5e4&response=json&signature=***
I0228 04:52:42.357795 1 client.go:175] NewAPIRequest response status code:200
I0228 04:52:52.356394 1 client.go:110] Still waiting for job 4e62a5a3-825c-435e-a6df-c22e756ee5e4 to complete
➜ ~ k get nodes -A
NAME STATUS ROLES AGE VERSION
gh-control-18debd77e18 Ready control-plane 10h v1.28.4
gh-node-18debd8440c Ready <none> 10h v1.28.4
gh-node-18dee0e78ba Ready <none> 42s v1.28.4
I notice @saffronjam used a regular user to deploy the CKS cluster, would it be related to the issue ?