Cannot deploy cluster-autoscaler with Rancher RKE2
Which component are you using?: cluster-autoscaler / cluster-autoscaler-chart
What version of the component are you using?:
Component version: cluster-autoscaler1.23.1 / cluster-autoscaler-chart-9.20.0
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8", GitCommit:"4a3b558c52eb6995b3c5c1db5e54111bd0645a64", GitTreeState:"clean", BuildDate:"2021-12-15T14:52:11Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6+rke2r2", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-28T19:13:01Z", GoVersion:"go1.17.9b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.23) exceeds the supported minor version skew of +/-1
What environment is this in?: Dev
What did you expect to happen?: Cluster Autoscaler to deploy with Helm chart to my Rancher RKE2 cluster after the changes from PR 4975.
What happened instead?: This error in the cluster autoscaler pod logs:
F0829 00:36:25.149073 1 main.go:430] Failed to get nodes from apiserver: nodes is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "nodes" in API group "" at the cluster scope
How to reproduce it (as minimally and precisely as possible):
- Create Rancher API key (in my case, not restrictions)
- Create Opaque Secret with you Rancher cloud-config:
apiVersion: v1
kind: Secret
metadata:
name: cluster-autoscaler-cloud-config
namespace: kube-system
type: Opaque
stringData:
cloud-config: |
# rancher server credentials
url: https://rancher.domain.com
token: [Redacted: token-*:*]
# name and namespace of the clusters.provisioning.cattle.io resource on the
# rancher server
clusterName: my-cluster
clusterNamespace: fleet-default
# optional, will be auto-discovered if not specified
#clusterAPIVersion: v1alpha4
autoDiscovery:
clusterName: my-cluster
labels: []
roles:
- worker
tags:
- k8s.io/cluster-autoscaler/enabled
- k8s.io/cluster-autoscaler/{{ .Values.autoDiscovery.clusterName }}
cloudProvider: rancher
extraVolumeSecrets:
cluster-autoscaler-cloud-config:
mountPath: /config
name: cluster-autoscaler-cloud-config
extraArgs:
logtostderr: true
stderrthreshold: info
v: 4
cloud-config: /config/cloud-config
cluster-name: my-cluster
image:
pullPolicy: IfNotPresent
pullSecrets: []
repository: k8s.gcr.io/autoscaling/cluster-autoscaler
tag: v1.23.1
Anything else we need to know?: On Rancher 2.6.7 with RKE2 1.23.x. Tried to deploy to downstream and management clusters (both on RKE2 1.23.x)
I am wondering if something is wrong in reading my deployed cloud-config?
@ctrox Do you have any ideas from the above?
From your message in the logs the autoscaler does not have permissions to get nodes, so I'm assuming it is missing some permissions (on the downstream cluster). If the autoscaler is running on the downstream cluster, you need to make sure the service account you set in the deployment has these permissions but that should happen automatically with the helm chart default values.
Also please note that cluster-autoscaler 1.23.1 that you linked does not contain the rancher provider, I'm assuming it will be in the next minor release (1.25.0).
Thanks @ctrox . I was wondering about that 1.23.1. I will double check the service account.
Hi @ctrox , I'm deploying the cluster-autoscaler chart similar to what's described in this issue, although my image.tag is v1.25.0 because the v1.23.1 image does not support the rancher cloud provider. From the autoscaler pod logs, I am seeing the following errors:
pre_filtering_processor.go:57] Node node_name should not be processed by cluster autoscaler (no node group config)
and
clusterstate.go:376] Failed to find readiness information for worker
clusterstate.go:438] Failed to find readiness information for worker
The API calls seem to be successful since autoscaler is able to read the node names in the cluster and discover the node group based on the log message:
rancher_provider.go:228] scalable node group found: worker (2:6)
I'm on kubernetes version v1.24.4+rke2r1 for reference so I think it's possible that there's a version mismatch with the autoscaler version 1.25.0 but I'm hoping someone can confirm whether that's the reason for the autoscaler failure or if there's something else going on.
The same issue here. cluster-autoscaler version 1.25.0, installed via helm. rancher version 2.6.9, rke2 version 1.24.4+rke2r1. Cluster type Amazon EC2. Autoscaler cmd: ./cluster-autoscaler --cloud-provider=rancher --namespace=kube-system --nodes=3:5:cpu-worker --cloud-config=/config/cloud-config --logtostderr=true --stderrthreshold=info --v=4
Cluster-autoscaler is terminating with error "Failed to find readiness information for cpu-worker" (exit code 137). "cpu-worker" is the name of pool in rancher cluster.
Please see cluster-autoscaler log in attached file.
Can you try without the --nodes flag? The node groups are discovered dynamically (using the annotations on the machinePool) so this flag should not be needed.
Also can you tell me the ProviderID of one of the nodes in the pool cpu-worker?
$ kubectl describe node <node> | grep ProviderID
I just verified here that cluster-autoscaler v1.25.0 runs fine with an RKE2 cluster, even a way older version of v1.21.14+rke2r1. I'm also on Rancher 2.6.9.
Hello @ctrox ,
Without --nodes flag result is the same. Here the snippet from cluster-autoscaler log:
I1107 17:39:19.455773 37 klogx.go:86] Pod ci-test/node-example-main-657d4bb7f4-fqwvn is unschedulable I1107 17:39:19.455810 37 scale_up.go:375] Upcoming 3 nodes W1107 17:39:19.455830 37 clusterstate.go:376] Failed to find readiness information for cpu-worker W1107 17:39:19.455844 37 clusterstate.go:376] Failed to find readiness information for cpu-worker W1107 17:39:19.455870 37 scale_up.go:395] Node group cpu-worker is not ready for scaleup - unhealthy I1107 17:39:19.455889 37 scale_up.go:462] No expansion option
And the output of command kubectl describe node i-0d33022be1ed6ac78.eu-central-1.compute.internal |grep ProviderID:
ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78
ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78
Aha, it makes sense now why it does not work with your EC2 backed cluster. This is a bit weird, looks like I (wrongly) assumed rancher would always set the ProviderID in a consistent way, no matter which backend node driver is used.
Just to be sure, you created your cluster with EC2 using RKE2 like so?

Would you mind sharing a full node object with kubectl get node <node> -o yaml? For a potential fix, I need to see if there's a way to figure out the node pool name from the node object.
Q: Rancher cloud-provider is not yet supported in the Helm chart right?
Hello @ctrox ,
Yes cluster type is definitively RKE2 on EC2. Please see screenshot.

Also please see node manifest yaml file in attached file. The file has txt extension, because attaching of yaml is not supported. Simple shange txt extension to yml if you want node.txt .
Q: Rancher cloud-provider is not yet supported in the Helm chart right?
I have not tested it but I think it should work with the helm chart. You just need to a few values like cloudProvider: rancher and cloudConfigPath respectively.
Thanks @nugzarg, I can think of a possible fix but I'm not yet sure when I will have time for that. I will look into it more on friday.
Thanks @ctrox .
Maybe this information is also relevant. If I let rancher to show the node in API (In older 2.5 version of rancher UI, which is hidden but still present), nodePoolId key is empty and there is no nodePoolname key.

Hey @ctrox, thanks for taking a look at this! What infrastructure provider are you using in your test environment, if any? Our clusters are being created on vSphere and running kubectl describe node node-name | grep ProviderID gets us ProviderID: vsphere://some-long-random-string that doesn't seem to follow any specific convention. Sounds like this issue will manifest if we use any cloud infrastructure provider?
I'm using a custom node driver which is not built-in. My guess is that just the ones that don't have a Cloud Provider in Rancher get a ProviderID rke2://. Anyways, it's clear to me that the autoscaler should not rely on the ProviderID anymore. I'm close to finishing up a PR, it will probably be done sometime next week.