autoscaler Feature Request: Specify placement group with Hetzner Autoscaler

Which component are you using?: cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Hetzner Placement Groups allow you to spread your VMs across different physical hardware which decreases the probability that some instances might fail together.

Describe the solution you'd like.: It would be nice be able to specify a given placement group for nodes managed by the autoscaler just like we can currently specify things like the network or SSH key. Maybe a new env variable like HCLOUD_PLACEMENT_GROUP could be introduced.

Describe any alternative solutions you've considered.: Of course, the status quo is fine. This would just make the cluster a bit more robust.

Jul 01 '23 08:07 dominic-p

Nice suggestion. Placement Groups have a limit of 10 instances, so using them for all your Nodes might become a Problem.

This can be introduced more nicely in the new JSON Format for Node Group configuration: HCLOUD_NODE_CONFIG

/area provider/hetzner

Oct 20 '23 05:10 apricote

Thanks for the reply! I wasn't aware of the new env variable. Yes, that doesn't seem like a good place to configure it. It's not entirely clear to me how it will interact with the exist variables it's replacing. If both HCLOUD_CLOUD_INIT and HCLOUD_CLUSTER_CONFIG are set, which one will take precedence?

So, if we were to add support for placement groups (acknowledging the 10 node limit), the JSON might look like this?

{
    "nodeConfigs": {
        "placementGroup": "name or ID here"
    }
}

Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?

Oct 23 '23 18:10 dominic-p

HCLOUD_CLUSTER_CONFIG will override HCLOUD_CLOUD_INIT: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_node_group.go#L416-L420

The JSON should look like this:

{
  "nodeConfigs": {
    "pool-1": {
      "placementGroup": "name or ID here"
    }
}

Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?

I would prefer if we would fail more loudly by just refusing to start. This is how the config validation is done right now, see klog.Fatalf: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_cloud_provider.go#L203-L215

Nov 20 '23 07:11 apricote

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 18 '24 08:02 k8s-triage-robot

Not stale for me.

Feb 19 '24 20:02 dominic-p

@dominic-p are you interested in implementing this? I can help out if you have any questions :)

/remove-lifecycle stale

Feb 20 '24 06:02 apricote

Thanks for following up. I can take a stab at it. Go's not my language, unfortunately, but the change may be trivial enough for me to muddle through. Can you point me in the direction (e.g. which file(s) I should start looking at)?

Feb 20 '24 17:02 dominic-p

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 20 '24 18:05 k8s-triage-robot

Still not stale for me. Still struggling to find the time to work on it.

May 20 '24 19:05 dominic-p

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 19 '24 20:06 k8s-triage-robot

Just checking in on this again.

If someone can point me in the right direction, I can take a stab at a PR for this. As I mentioned before, Go isn't a language I know, but it should be pretty straightforward.

Jun 19 '24 21:06 dominic-p

Some steps:

Add new field PlacementGroup string to struct NodeConfig in hetzner_manager.go
Add new field PlacementGroup *hcloud.PlacementGroup to struct hetznerNodeGroup in hetzner_node_group.go
In hetzner_cloud_provider.go BuildHetzner in line 217: If a PlacementGroup is specified in NodeConfigs[spec.name]: check with the API that the placement group exists. Add it to the created hetznerNodeGroup in Line 219
After the loop over all nodeGroups (Line 230): Check that each unique placement group has a max sum of 10 maxSize over all nodeGroups that have specified this placement group.

Jun 28 '24 07:06 apricote

Thank you so much for taking the time to lay that out for me. I took a shot at a PR. Let me know what you think.

As I said, go is not my language, but I think I was able to muddle through the steps you gave me.

Jul 03 '24 06:07 dominic-p

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Aug 02 '24 07:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 02 '24 07:08 k8s-ci-robot

reopen the issue, as PR https://github.com/kubernetes/autoscaler/pull/6999 is already opened and is under the review stage.

/reopen

Oct 22 '24 13:10 Shubham82

@Shubham82: Reopened this issue.

In response to this:

reopen the issue, as PR https://github.com/kubernetes/autoscaler/pull/6999 is already opened and is under the review stage.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 22 '24 13:10 k8s-ci-robot

autoscaler autoscaler copied to clipboard

Feature Request: Specify placement group with Hetzner Autoscaler

autoscaler
autoscaler copied to clipboard