autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Cannot deploy cluster-autoscaler with Rancher RKE2

Open bennysp opened this issue 3 years ago • 4 comments

Which component are you using?: cluster-autoscaler / cluster-autoscaler-chart

What version of the component are you using?:

Component version: cluster-autoscaler1.23.1 / cluster-autoscaler-chart-9.20.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8", GitCommit:"4a3b558c52eb6995b3c5c1db5e54111bd0645a64", GitTreeState:"clean", BuildDate:"2021-12-15T14:52:11Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6+rke2r2", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-28T19:13:01Z", GoVersion:"go1.17.9b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.23) exceeds the supported minor version skew of +/-1

What environment is this in?: Dev

What did you expect to happen?: Cluster Autoscaler to deploy with Helm chart to my Rancher RKE2 cluster after the changes from PR 4975.

What happened instead?: This error in the cluster autoscaler pod logs:

F0829 00:36:25.149073 1 main.go:430] Failed to get nodes from apiserver: nodes is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "nodes" in API group "" at the cluster scope

How to reproduce it (as minimally and precisely as possible):

  1. Create Rancher API key (in my case, not restrictions)
  2. Create Opaque Secret with you Rancher cloud-config:
apiVersion: v1
kind: Secret
metadata:
  name: cluster-autoscaler-cloud-config
  namespace: kube-system
type: Opaque
stringData:
  cloud-config: |
    # rancher server credentials
    url: https://rancher.domain.com
    token: [Redacted: token-*:*]
    # name and namespace of the clusters.provisioning.cattle.io resource on the
    # rancher server
    clusterName: my-cluster
    clusterNamespace: fleet-default
    # optional, will be auto-discovered if not specified
    #clusterAPIVersion: v1alpha4
3. Use the below helm values:
autoDiscovery:
  clusterName: my-cluster
  labels: []
  roles:
    - worker
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/{{ .Values.autoDiscovery.clusterName }}
cloudProvider: rancher
extraVolumeSecrets:
  cluster-autoscaler-cloud-config:
    mountPath: /config
    name: cluster-autoscaler-cloud-config
extraArgs:
  logtostderr: true
  stderrthreshold: info
  v: 4
  cloud-config: /config/cloud-config
  cluster-name: my-cluster
image:
  pullPolicy: IfNotPresent
  pullSecrets: []
  repository: k8s.gcr.io/autoscaling/cluster-autoscaler
  tag: v1.23.1

Anything else we need to know?: On Rancher 2.6.7 with RKE2 1.23.x. Tried to deploy to downstream and management clusters (both on RKE2 1.23.x)

I am wondering if something is wrong in reading my deployed cloud-config?

bennysp avatar Aug 29 '22 00:08 bennysp

@ctrox Do you have any ideas from the above?

bennysp avatar Aug 30 '22 02:08 bennysp

From your message in the logs the autoscaler does not have permissions to get nodes, so I'm assuming it is missing some permissions (on the downstream cluster). If the autoscaler is running on the downstream cluster, you need to make sure the service account you set in the deployment has these permissions but that should happen automatically with the helm chart default values.

Also please note that cluster-autoscaler 1.23.1 that you linked does not contain the rancher provider, I'm assuming it will be in the next minor release (1.25.0).

ctrox avatar Aug 30 '22 06:08 ctrox

Thanks @ctrox . I was wondering about that 1.23.1. I will double check the service account.

bennysp avatar Aug 31 '22 19:08 bennysp

Hi @ctrox , I'm deploying the cluster-autoscaler chart similar to what's described in this issue, although my image.tag is v1.25.0 because the v1.23.1 image does not support the rancher cloud provider. From the autoscaler pod logs, I am seeing the following errors: pre_filtering_processor.go:57] Node node_name should not be processed by cluster autoscaler (no node group config) and clusterstate.go:376] Failed to find readiness information for worker clusterstate.go:438] Failed to find readiness information for worker

The API calls seem to be successful since autoscaler is able to read the node names in the cluster and discover the node group based on the log message: rancher_provider.go:228] scalable node group found: worker (2:6)

I'm on kubernetes version v1.24.4+rke2r1 for reference so I think it's possible that there's a version mismatch with the autoscaler version 1.25.0 but I'm hoping someone can confirm whether that's the reason for the autoscaler failure or if there's something else going on.

jameswu2 avatar Oct 10 '22 20:10 jameswu2

The same issue here. cluster-autoscaler version 1.25.0, installed via helm. rancher version 2.6.9, rke2 version 1.24.4+rke2r1. Cluster type Amazon EC2. Autoscaler cmd: ./cluster-autoscaler --cloud-provider=rancher --namespace=kube-system --nodes=3:5:cpu-worker --cloud-config=/config/cloud-config --logtostderr=true --stderrthreshold=info --v=4

Cluster-autoscaler is terminating with error "Failed to find readiness information for cpu-worker" (exit code 137). "cpu-worker" is the name of pool in rancher cluster.

Please see cluster-autoscaler log in attached file.

cluster-autoscaler-rancher.log

nugzarg avatar Nov 07 '22 07:11 nugzarg

Can you try without the --nodes flag? The node groups are discovered dynamically (using the annotations on the machinePool) so this flag should not be needed.

Also can you tell me the ProviderID of one of the nodes in the pool cpu-worker?

$ kubectl describe node <node> | grep ProviderID

I just verified here that cluster-autoscaler v1.25.0 runs fine with an RKE2 cluster, even a way older version of v1.21.14+rke2r1. I'm also on Rancher 2.6.9.

ctrox avatar Nov 07 '22 14:11 ctrox

Hello @ctrox ,

Without --nodes flag result is the same. Here the snippet from cluster-autoscaler log:

I1107 17:39:19.455773 37 klogx.go:86] Pod ci-test/node-example-main-657d4bb7f4-fqwvn is unschedulable I1107 17:39:19.455810 37 scale_up.go:375] Upcoming 3 nodes W1107 17:39:19.455830 37 clusterstate.go:376] Failed to find readiness information for cpu-worker W1107 17:39:19.455844 37 clusterstate.go:376] Failed to find readiness information for cpu-worker W1107 17:39:19.455870 37 scale_up.go:395] Node group cpu-worker is not ready for scaleup - unhealthy I1107 17:39:19.455889 37 scale_up.go:462] No expansion option

And the output of command kubectl describe node i-0d33022be1ed6ac78.eu-central-1.compute.internal |grep ProviderID:

ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78

nugzarg avatar Nov 07 '22 17:11 nugzarg

ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78

Aha, it makes sense now why it does not work with your EC2 backed cluster. This is a bit weird, looks like I (wrongly) assumed rancher would always set the ProviderID in a consistent way, no matter which backend node driver is used.

Just to be sure, you created your cluster with EC2 using RKE2 like so?

image

Would you mind sharing a full node object with kubectl get node <node> -o yaml? For a potential fix, I need to see if there's a way to figure out the node pool name from the node object.

ctrox avatar Nov 08 '22 16:11 ctrox

Q: Rancher cloud-provider is not yet supported in the Helm chart right?

eliaskoromilas avatar Nov 08 '22 16:11 eliaskoromilas

Hello @ctrox ,

Yes cluster type is definitively RKE2 on EC2. Please see screenshot. Screenshot_20221109_093347

Also please see node manifest yaml file in attached file. The file has txt extension, because attaching of yaml is not supported. Simple shange txt extension to yml if you want node.txt .

nugzarg avatar Nov 09 '22 05:11 nugzarg

Q: Rancher cloud-provider is not yet supported in the Helm chart right?

I have not tested it but I think it should work with the helm chart. You just need to a few values like cloudProvider: rancher and cloudConfigPath respectively.

Thanks @nugzarg, I can think of a possible fix but I'm not yet sure when I will have time for that. I will look into it more on friday.

ctrox avatar Nov 09 '22 08:11 ctrox

Thanks @ctrox . Maybe this information is also relevant. If I let rancher to show the node in API (In older 2.5 version of rancher UI, which is hidden but still present), nodePoolId key is empty and there is no nodePoolname key. Screenshot_20221109_175454

nugzarg avatar Nov 09 '22 13:11 nugzarg

Hey @ctrox, thanks for taking a look at this! What infrastructure provider are you using in your test environment, if any? Our clusters are being created on vSphere and running kubectl describe node node-name | grep ProviderID gets us ProviderID: vsphere://some-long-random-string that doesn't seem to follow any specific convention. Sounds like this issue will manifest if we use any cloud infrastructure provider?

jameswu2 avatar Nov 09 '22 14:11 jameswu2

I'm using a custom node driver which is not built-in. My guess is that just the ones that don't have a Cloud Provider in Rancher get a ProviderID rke2://. Anyways, it's clear to me that the autoscaler should not rely on the ProviderID anymore. I'm close to finishing up a PR, it will probably be done sometime next week.

ctrox avatar Nov 11 '22 16:11 ctrox