autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

On-premise scaling to AWS

Open nemcikjan opened this issue 2 years ago • 13 comments

Which component are you using?:

cluster autoscaler

Describe the solution you'd like.:

I have a need to scale from on premise k8s cluster into AWS and I'm not able to get it working. I created ASG and provided sufficient AWS credentials. When I try to deploy a pod with node selector matching the ASG labels, the instance is not spun up. Moreover, the cluster autoscaler pod is automatically restarted every e.g. 15 mins. Any ideas/suggestions how to get this working? Many thanks

Additional context.: This is basically the log that is constantly appearing until the pod restarts.

I0314 18:51:56.626978       1 auto_scaling_groups.go:386] Regenerating instance to ASG map for ASGs: [<asg_name>]
I0314 18:51:56.742505       1 auto_scaling_groups.go:154] Registering ASG <asg_name>
I0314 18:51:56.742559       1 aws_wrapper.go:281] 0 launch configurations to query
I0314 18:51:56.742572       1 aws_wrapper.go:282] 1 launch templates to query
I0314 18:51:56.742591       1 aws_wrapper.go:298] Successfully queried 0 launch configurations
I0314 18:51:56.778379       1 aws_wrapper.go:309] Successfully queried 1 launch templates
I0314 18:51:56.778438       1 auto_scaling_groups.go:435] Extracted autoscaling options from "<asg_name>" ASG tags: map[]
I0314 18:51:56.778464       1 aws_manager.go:266] Refreshed ASG list, next refresh after 2023-03-14 18:52:56.778457442 +0000 UTC m=+61.554755784
I0314 18:51:56.778928       1 main.go:305] Registered cleanup signal handler
I0314 18:51:56.779120       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0314 18:51:56.779164       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 18.439µs
I0314 18:52:06.780107       1 static_autoscaler.go:235] Starting main loop
E0314 18:52:06.782121       1 static_autoscaler.go:290] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got rke-2://<node_name>

nemcikjan avatar Mar 14 '23 18:03 nemcikjan

What happens:

  1. Your cluster autoscaler is configured to work with AWS.
  2. It processes the existing nodes in the cluster and tries to extract specific information from them (in order to build template nodes for its simulations), based on the node.Spec.ProviderID field of the Node object.
  3. Since these nodes are on-prem (using Rancher K8s, if I get it right), the node.Spec.ProviderID does not match the AWS-valid format aws:///<zone>/<name>, thus the autoscaler fails.

I think that currently, the cluster autoscaler can not support multi-platform clusters, I will let someone provide an authoritative confirmation on this and if there are any plans to offer support in future versions.

gregth avatar Mar 15 '23 08:03 gregth

@gregth Thank you for your response. I tried to looked into the code and made the same assumption, that multi-platform clusters, in this case on-prem rke2 and EC2 nodes, are not supported.

nemcikjan avatar Mar 15 '23 12:03 nemcikjan

Which version of cluster-autoscaler are you running? This should work now with rke2 nodes on AWS since https://github.com/kubernetes/autoscaler/pull/5361, which should have made it into cluster-autoscaler-1.26.0.

ctrox avatar Mar 28 '23 10:03 ctrox

@ctrox I was running v1.24. I just tried v1.26 but it's still the same. But I think you misunderstood our intentions. We are running rke2 nodes on on-premise cluster not using rancher just plain rke2 and want to scale out to run rke2 nodes in AWS, so we are not running rke2 nodes in AWS yet. The question is, if it would help to have an arbiter node running in AWS on EC2?

nemcikjan avatar Apr 03 '23 09:04 nemcikjan

@JanNemcik Ah right, I have misunderstood your setup then. Not sure if what you are trying to do is supported yet.

ctrox avatar Apr 03 '23 10:04 ctrox

@ctrox do you even think that this use case does make sense and is it possible that it will be implemented in the future?

nemcikjan avatar Apr 04 '23 07:04 nemcikjan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 03 '23 08:07 k8s-triage-robot

Hey! Did you manage to make this work?

voliveira-tmx avatar Oct 25 '23 17:10 voliveira-tmx

@voliveira-tmx nope

nemcikjan avatar Oct 25 '23 21:10 nemcikjan

This is a very interesting use case, that I wanted to implement. I haven't tried it myself though, I was trying to gather some knowledge first, but I haven't been able to find any practical examples on how to set this up. Any updates on this @ctrox?

voliveira-tmx avatar Oct 25 '23 21:10 voliveira-tmx

This is not something I'm working on as it affects the AWS provider of cluster-autoscaler and might even need changes in the core cluster-autoscaler to make this work. I just maintain the rancher provider which is not involved here, I just thought it was in the beginning.

ctrox avatar Oct 26 '23 06:10 ctrox

This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.

nemcikjan avatar Oct 26 '23 08:10 nemcikjan

/remove-lifecycle stale

Shubham82 avatar Nov 28 '23 12:11 Shubham82

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 26 '24 12:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 20 '24 13:04 k8s-triage-robot

/remove-lifecycle rotten

Shubham82 avatar May 07 '24 07:05 Shubham82

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 05 '24 07:08 k8s-triage-robot

This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.

@nemcikjan, as mentioned in the above comment, do you plan to open the issue with a more generic description?

Shubham82 avatar Aug 05 '24 10:08 Shubham82