autoscaler
autoscaler copied to clipboard
On-premise scaling to AWS
Which component are you using?:
cluster autoscaler
Describe the solution you'd like.:
I have a need to scale from on premise k8s cluster into AWS and I'm not able to get it working. I created ASG and provided sufficient AWS credentials. When I try to deploy a pod with node selector matching the ASG labels, the instance is not spun up. Moreover, the cluster autoscaler pod is automatically restarted every e.g. 15 mins. Any ideas/suggestions how to get this working? Many thanks
Additional context.: This is basically the log that is constantly appearing until the pod restarts.
I0314 18:51:56.626978 1 auto_scaling_groups.go:386] Regenerating instance to ASG map for ASGs: [<asg_name>]
I0314 18:51:56.742505 1 auto_scaling_groups.go:154] Registering ASG <asg_name>
I0314 18:51:56.742559 1 aws_wrapper.go:281] 0 launch configurations to query
I0314 18:51:56.742572 1 aws_wrapper.go:282] 1 launch templates to query
I0314 18:51:56.742591 1 aws_wrapper.go:298] Successfully queried 0 launch configurations
I0314 18:51:56.778379 1 aws_wrapper.go:309] Successfully queried 1 launch templates
I0314 18:51:56.778438 1 auto_scaling_groups.go:435] Extracted autoscaling options from "<asg_name>" ASG tags: map[]
I0314 18:51:56.778464 1 aws_manager.go:266] Refreshed ASG list, next refresh after 2023-03-14 18:52:56.778457442 +0000 UTC m=+61.554755784
I0314 18:51:56.778928 1 main.go:305] Registered cleanup signal handler
I0314 18:51:56.779120 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0314 18:51:56.779164 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 18.439µs
I0314 18:52:06.780107 1 static_autoscaler.go:235] Starting main loop
E0314 18:52:06.782121 1 static_autoscaler.go:290] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got rke-2://<node_name>
What happens:
- Your cluster autoscaler is configured to work with AWS.
- It processes the existing nodes in the cluster and tries to extract specific information from them (in order to build template nodes for its simulations), based on the
node.Spec.ProviderIDfield of theNodeobject. - Since these nodes are on-prem (using Rancher K8s, if I get it right), the
node.Spec.ProviderIDdoes not match the AWS-valid formataws:///<zone>/<name>, thus the autoscaler fails.
I think that currently, the cluster autoscaler can not support multi-platform clusters, I will let someone provide an authoritative confirmation on this and if there are any plans to offer support in future versions.
@gregth Thank you for your response. I tried to looked into the code and made the same assumption, that multi-platform clusters, in this case on-prem rke2 and EC2 nodes, are not supported.
Which version of cluster-autoscaler are you running? This should work now with rke2 nodes on AWS since https://github.com/kubernetes/autoscaler/pull/5361, which should have made it into cluster-autoscaler-1.26.0.
@ctrox I was running v1.24. I just tried v1.26 but it's still the same. But I think you misunderstood our intentions. We are running rke2 nodes on on-premise cluster not using rancher just plain rke2 and want to scale out to run rke2 nodes in AWS, so we are not running rke2 nodes in AWS yet. The question is, if it would help to have an arbiter node running in AWS on EC2?
@JanNemcik Ah right, I have misunderstood your setup then. Not sure if what you are trying to do is supported yet.
@ctrox do you even think that this use case does make sense and is it possible that it will be implemented in the future?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Hey! Did you manage to make this work?
@voliveira-tmx nope
This is a very interesting use case, that I wanted to implement. I haven't tried it myself though, I was trying to gather some knowledge first, but I haven't been able to find any practical examples on how to set this up. Any updates on this @ctrox?
This is not something I'm working on as it affects the AWS provider of cluster-autoscaler and might even need changes in the core cluster-autoscaler to make this work. I just maintain the rancher provider which is not involved here, I just thought it was in the beginning.
This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
This shouldn't event be AWS only specific. As @ctrox mentioned, it'd definitely require changes in the core codebase, so ti could support other vendors as well. I think it's not worth it to do it for each provider respectively, because it'd require a lot of additional work around. Maybe I should close this issue and create a new one with more generic description of the problem.
@nemcikjan, as mentioned in the above comment, do you plan to open the issue with a more generic description?