autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Feature Request: Specify placement group with Hetzner Autoscaler

Open dominic-p opened this issue 2 years ago • 17 comments

Which component are you using?: cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Hetzner Placement Groups allow you to spread your VMs across different physical hardware which decreases the probability that some instances might fail together.

Describe the solution you'd like.: It would be nice be able to specify a given placement group for nodes managed by the autoscaler just like we can currently specify things like the network or SSH key. Maybe a new env variable like HCLOUD_PLACEMENT_GROUP could be introduced.

Describe any alternative solutions you've considered.: Of course, the status quo is fine. This would just make the cluster a bit more robust.

dominic-p avatar Jul 01 '23 08:07 dominic-p

Nice suggestion. Placement Groups have a limit of 10 instances, so using them for all your Nodes might become a Problem.

This can be introduced more nicely in the new JSON Format for Node Group configuration: HCLOUD_NODE_CONFIG

/area provider/hetzner

apricote avatar Oct 20 '23 05:10 apricote

Thanks for the reply! I wasn't aware of the new env variable. Yes, that doesn't seem like a good place to configure it. It's not entirely clear to me how it will interact with the exist variables it's replacing. If both HCLOUD_CLOUD_INIT and HCLOUD_CLUSTER_CONFIG are set, which one will take precedence?

So, if we were to add support for placement groups (acknowledging the 10 node limit), the JSON might look like this?

{
    "nodeConfigs": {
        "placementGroup": "name or ID here"
    }
}

Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?

dominic-p avatar Oct 23 '23 18:10 dominic-p

HCLOUD_CLUSTER_CONFIG will override HCLOUD_CLOUD_INIT: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_node_group.go#L416-L420

The JSON should look like this:

{
  "nodeConfigs": {
    "pool-1": {
      "placementGroup": "name or ID here"
    }
}

Could a warning be shown in the logs and the setting disabled if a placement group is configured for a node pool that could have more than 10 nodes?

I would prefer if we would fail more loudly by just refusing to start. This is how the config validation is done right now, see klog.Fatalf: https://github.com/kubernetes/autoscaler/blob/81eed9622b009fc3f3f71429f124688c71a4dfc1/cluster-autoscaler/cloudprovider/hetzner/hetzner_cloud_provider.go#L203-L215

apricote avatar Nov 20 '23 07:11 apricote

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 18 '24 08:02 k8s-triage-robot

Not stale for me.

dominic-p avatar Feb 19 '24 20:02 dominic-p

@dominic-p are you interested in implementing this? I can help out if you have any questions :)

/remove-lifecycle stale

apricote avatar Feb 20 '24 06:02 apricote

Thanks for following up. I can take a stab at it. Go's not my language, unfortunately, but the change may be trivial enough for me to muddle through. Can you point me in the direction (e.g. which file(s) I should start looking at)?

dominic-p avatar Feb 20 '24 17:02 dominic-p

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 20 '24 18:05 k8s-triage-robot

Still not stale for me. Still struggling to find the time to work on it.

dominic-p avatar May 20 '24 19:05 dominic-p

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 19 '24 20:06 k8s-triage-robot

Just checking in on this again.

If someone can point me in the right direction, I can take a stab at a PR for this. As I mentioned before, Go isn't a language I know, but it should be pretty straightforward.

dominic-p avatar Jun 19 '24 21:06 dominic-p

Some steps:

  1. Add new field PlacementGroup string to struct NodeConfig in hetzner_manager.go
  2. Add new field PlacementGroup *hcloud.PlacementGroup to struct hetznerNodeGroup in hetzner_node_group.go
  3. In hetzner_cloud_provider.go BuildHetzner in line 217: If a PlacementGroup is specified in NodeConfigs[spec.name]: check with the API that the placement group exists. Add it to the created hetznerNodeGroup in Line 219
  4. After the loop over all nodeGroups (Line 230): Check that each unique placement group has a max sum of 10 maxSize over all nodeGroups that have specified this placement group.

apricote avatar Jun 28 '24 07:06 apricote

Thank you so much for taking the time to lay that out for me. I took a shot at a PR. Let me know what you think.

As I said, go is not my language, but I think I was able to muddle through the steps you gave me.

dominic-p avatar Jul 03 '24 06:07 dominic-p

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Aug 02 '24 07:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 02 '24 07:08 k8s-ci-robot

reopen the issue, as PR https://github.com/kubernetes/autoscaler/pull/6999 is already opened and is under the review stage.

/reopen

Shubham82 avatar Oct 22 '24 13:10 Shubham82

@Shubham82: Reopened this issue.

In response to this:

reopen the issue, as PR https://github.com/kubernetes/autoscaler/pull/6999 is already opened and is under the review stage.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 22 '24 13:10 k8s-ci-robot