autoscaler On-premise scaling to AWS

Which component are you using?:

cluster autoscaler

Describe the solution you'd like.:

I have a need to scale from on premise k8s cluster into AWS and I'm not able to get it working. I created ASG and provided sufficient AWS credentials. When I try to deploy a pod with node selector matching the ASG labels, the instance is not spun up. Moreover, the cluster autoscaler pod is automatically restarted every e.g. 15 mins. Any ideas/suggestions how to get this working? Many thanks

Additional context.: This is basically the log that is constantly appearing until the pod restarts.

I0314 18:51:56.626978       1 auto_scaling_groups.go:386] Regenerating instance to ASG map for ASGs: [<asg_name>]
I0314 18:51:56.742505       1 auto_scaling_groups.go:154] Registering ASG <asg_name>
I0314 18:51:56.742559       1 aws_wrapper.go:281] 0 launch configurations to query
I0314 18:51:56.742572       1 aws_wrapper.go:282] 1 launch templates to query
I0314 18:51:56.742591       1 aws_wrapper.go:298] Successfully queried 0 launch configurations
I0314 18:51:56.778379       1 aws_wrapper.go:309] Successfully queried 1 launch templates
I0314 18:51:56.778438       1 auto_scaling_groups.go:435] Extracted autoscaling options from "<asg_name>" ASG tags: map[]
I0314 18:51:56.778464       1 aws_manager.go:266] Refreshed ASG list, next refresh after 2023-03-14 18:52:56.778457442 +0000 UTC m=+61.554755784
I0314 18:51:56.778928       1 main.go:305] Registered cleanup signal handler
I0314 18:51:56.779120       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0314 18:51:56.779164       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 18.439µs
I0314 18:52:06.780107       1 static_autoscaler.go:235] Starting main loop
E0314 18:52:06.782121       1 static_autoscaler.go:290] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got rke-2://<node_name>

Mar 14 '23 18:03 nemcikjan

What happens:

Your cluster autoscaler is configured to work with AWS.
It processes the existing nodes in the cluster and tries to extract specific information from them (in order to build template nodes for its simulations), based on the node.Spec.ProviderID field of the Node object.
Since these nodes are on-prem (using Rancher K8s, if I get it right), the node.Spec.ProviderID does not match the AWS-valid format aws:///<zone>/<name>, thus the autoscaler fails.

I think that currently, the cluster autoscaler can not support multi-platform clusters, I will let someone provide an authoritative confirmation on this and if there are any plans to offer support in future versions.

Mar 15 '23 08:03 gregth

@gregth Thank you for your response. I tried to looked into the code and made the same assumption, that multi-platform clusters, in this case on-prem rke2 and EC2 nodes, are not supported.

Mar 15 '23 12:03 nemcikjan

Which version of cluster-autoscaler are you running? This should work now with rke2 nodes on AWS since https://github.com/kubernetes/autoscaler/pull/5361, which should have made it into cluster-autoscaler-1.26.0.

Mar 28 '23 10:03 ctrox

@ctrox I was running v1.24. I just tried v1.26 but it's still the same. But I think you misunderstood our intentions. We are running rke2 nodes on on-premise cluster not using rancher just plain rke2 and want to scale out to run rke2 nodes in AWS, so we are not running rke2 nodes in AWS yet. The question is, if it would help to have an arbiter node running in AWS on EC2?

Apr 03 '23 09:04 nemcikjan

@JanNemcik Ah right, I have misunderstood your setup then. Not sure if what you are trying to do is supported yet.

Apr 03 '23 10:04 ctrox

@ctrox do you even think that this use case does make sense and is it possible that it will be implemented in the future?

Apr 04 '23 07:04 nemcikjan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 03 '23 08:07 k8s-triage-robot

Hey! Did you manage to make this work?

Oct 25 '23 17:10 voliveira-tmx

@voliveira-tmx nope

Oct 25 '23 21:10 nemcikjan

This is a very interesting use case, that I wanted to implement. I haven't tried it myself though, I was trying to gather some knowledge first, but I haven't been able to find any practical examples on how to set this up. Any updates on this @ctrox?

Oct 25 '23 21:10 voliveira-tmx

This is not something I'm working on as it affects the AWS provider of cluster-autoscaler and might even need changes in the core cluster-autoscaler to make this work. I just maintain the rancher provider which is not involved here, I just thought it was in the beginning.

Oct 26 '23 06:10 ctrox

This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.

Oct 26 '23 08:10 nemcikjan

/remove-lifecycle stale

Nov 28 '23 12:11 Shubham82

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 26 '24 12:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 20 '24 13:04 k8s-triage-robot

/remove-lifecycle rotten

May 07 '24 07:05 Shubham82

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 05 '24 07:08 k8s-triage-robot

This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.

@nemcikjan, as mentioned in the above comment, do you plan to open the issue with a more generic description?

Aug 05 '24 10:08 Shubham82

autoscaler autoscaler copied to clipboard

On-premise scaling to AWS

autoscaler
autoscaler copied to clipboard