autoscaler
autoscaler copied to clipboard
Feature Request: Specify placement group with Hetzner Autoscaler
Which component are you using?: cluster-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Hetzner Placement Groups allow you to spread your VMs across different physical hardware which decreases the probability that some instances might fail together.
Describe the solution you'd like.:
It would be nice be able to specify a given placement group for nodes managed by the autoscaler just like we can currently specify things like the network or SSH key. Maybe a new env variable like HCLOUD_PLACEMENT_GROUP could be introduced.
Describe any alternative solutions you've considered.: Of course, the status quo is fine. This would just make the cluster a bit more robust.
Nice suggestion. Placement Groups have a limit of 10 instances, so using them for all your Nodes might become a Problem.
This can be introduced more nicely in the new JSON Format for Node Group configuration: HCLOUD_NODE_CONFIG
/area provider/hetzner
Thanks for the reply! I wasn't aware of the new env variable. Yes, that doesn't seem like a good place to configure it. It's not entirely clear to me how it will interact with the exist variables it's replacing. If both HCLOUD_CLOUD_INIT and HCLOUD_CLUSTER_CONFIG are set, which one will take precedence?
So, if we were to add support for placement groups (acknowledging the 10 node limit), the JSON might look like this?
{
"nodeConfigs": {
"placementGroup": "name or ID here"
}
}
Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?
HCLOUD_CLUSTER_CONFIG will override HCLOUD_CLOUD_INIT: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_node_group.go#L416-L420
The JSON should look like this:
{
"nodeConfigs": {
"pool-1": {
"placementGroup": "name or ID here"
}
}
Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?
I would prefer if we would fail more loudly by just refusing to start. This is how the config validation is done right now, see klog.Fatalf: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_cloud_provider.go#L203-L215
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Not stale for me.
@dominic-p are you interested in implementing this? I can help out if you have any questions :)
/remove-lifecycle stale
Thanks for following up. I can take a stab at it. Go's not my language, unfortunately, but the change may be trivial enough for me to muddle through. Can you point me in the direction (e.g. which file(s) I should start looking at)?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Still not stale for me. Still struggling to find the time to work on it.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Just checking in on this again.
If someone can point me in the right direction, I can take a stab at a PR for this. As I mentioned before, Go isn't a language I know, but it should be pretty straightforward.
Some steps:
- Add new field
PlacementGroup stringto structNodeConfiginhetzner_manager.go - Add new field
PlacementGroup *hcloud.PlacementGroupto structhetznerNodeGroupinhetzner_node_group.go - In
hetzner_cloud_provider.goBuildHetznerin line 217: If a PlacementGroup is specified inNodeConfigs[spec.name]: check with the API that the placement group exists. Add it to the createdhetznerNodeGroupin Line 219 - After the loop over all nodeGroups (Line 230): Check that each unique placement group has a max sum of 10
maxSizeover all nodeGroups that have specified this placement group.
Thank you so much for taking the time to lay that out for me. I took a shot at a PR. Let me know what you think.
As I said, go is not my language, but I think I was able to muddle through the steps you gave me.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
reopen the issue, as PR https://github.com/kubernetes/autoscaler/pull/6999 is already opened and is under the review stage.
/reopen
@Shubham82: Reopened this issue.
In response to this:
reopen the issue, as PR https://github.com/kubernetes/autoscaler/pull/6999 is already opened and is under the review stage.
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.