terraform-hcloud-kube-hetzner Use with cluster autoscaler

How would this be used with cluster autoscaler? Does this module support cluster autoscaler? If so, how would I configure it?

Jun 26 '22 23:06 BlakeB415

Hey @BlakeB415, we do not support cluster autoscaler yet! PRs are always welcome!

Jun 27 '22 11:06 mysticaltech

Chipping in on the matter, for that I've been testing Kubeone's Kubermatic, it will get you a cluster on Hetzner with a Machine Controller able to scale the number of physical nodes like you scale a Deployment.

That said I very much prefer this project's simplicity and all the goodies it has, so seeing that functionality here would be amazing.

Maybe it is possible to install their machine controller on a cluster provisioned with this project, haven't tried yet.

Jun 29 '22 10:06 p4block

yes, an autoscaling cluster would be the only missing link and it's perfect in my opinion... Idk how it works, but could the masternode contain/create a limited usage hetzner-api that triggers a scale up on >80% load or something?

probably not really secure, right?

Jul 06 '22 14:07 JustinGuese

@BlakeB415 cluster-autoscaler seems to contain hetzner already https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider

so it should in theory not be too much work? I'm open to take a look if it's not too much

Jul 06 '22 14:07 JustinGuese

@JustinGuese Exactly, I imagine that the Kubernetes native autoscaler would be ideal to "sense" the need to autoscale, but this project runs with Terraform, so the actual autoscaling needs to be done by changing the node count at the nodepool level, in the kube.tf file. Recently I stumbled upon this project that maybe could help https://github.com/runatlantis/atlantis.

Maybe combining the two would get us something working?! It would of course require us to run Atlantis somewhere. We could also couple that with a simple NextJS web app (or Python Streamlit app) that would allow us to manage all of that, and bake in more "intelligence".

The latter option would require a side project, we could call it KubeHetzner UI, or something similar. But all these are just suppositions, maybe there is a shorter path, through a Rancher server for instance?!

Jul 06 '22 16:07 mysticaltech

Disclaimer: The following is just my view on things and may not be the best idea

The number of non-controlplane nodes is not something Terraform is good for, that should be the job of the autoscaler/machine controller. Ideally we define the nodepools in terraform like we are doing now but then kubernetes native systems manage how many nodes per pool to provision.

The magic on how to provision the nodes without a user armed with Terraform on their computer/cd system is coded by Kubermatic's Machine controller but how it actually works escapes me. From my limited testing of Kubeone, only the control plane is provisioned with Terraform and the worker pools are all managed through the Machine Controller.

There's also this https://github.com/syself/cluster-api-provider-hetzner but it goes a step beyond and gets rid of terraform entirely, which I also need to try.

Jul 07 '22 10:07 p4block

@p4block It's very interesting, and we could do that by saving a worker node snapshot "image" during the deployment, and later let the autoscaler use that to deploy new worker nodes.

Exactly how to do that needs to be researched, possibly it can be done all in terraform, but at least I know it can be done via the hcloud cli.

Really worth researching more! PRs welcome :)

Jul 07 '22 16:07 mysticaltech

@mysticaltech yeah with an external service for sure, but i guess we're leaving open source then...

something i could imagine with the native solution would be to somehow grab the requirements from terraform and insert them in the end, i guess the hardest would be like @p4block said the image of a worker node...

the native scaler requires:


HCLOUD_CLOUD_INIT Base64 encoded Cloud Init yaml with commands to join the cluster, Sample [examples/cloud-init.txt for (Kubernetes 1.20.1)](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/examples/cloud-init.txt)

HCLOUD_IMAGE Defaults to ubuntu-20.04, @see https://docs.hetzner.cloud/#images. You can also use an image ID here (e.g. 15512617), or a label selector associated with a custom snapshot (e.g. customized_ubuntu=true). The most recent snapshot will be used in the latter case.

HCLOUD_NETWORK Default empty , The name of the network that is used in the cluster , @see https://docs.hetzner.cloud/#networks

HCLOUD_FIREWALL Default empty , The name of the firewall that is used in the cluster , @see https://docs.hetzner.cloud/#firewalls

HCLOUD_SSH_KEY Default empty , This SSH Key will have access to the fresh created server, @see https://docs.hetzner.cloud/#ssh-keys

do you have an idea if we can grab these values? I mean except the image it should be cool?

Jul 11 '22 14:07 JustinGuese

@JustinGuese Thanks for extracting and sharing those details. We have all these values!! The only thing missing is creating a snapshot at the end of the install and passing its ID to HCLOUD_IMAGE.

It's completely doable. But I just do not have the bandwidth right away to work on this, but I will help the best I can.

That will help get us there https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/snapshot.

Jul 11 '22 23:07 mysticaltech

This problem has simplified a lot with the above findings. Just one thing ideally we would create different snapshots for each kind of agent nodepools. That way, choosing the kind of server we want to autoscale with would be easy.

Jul 15 '22 10:07 mysticaltech

Folks, no one wants to give that a shot? It's not that hard, especially if we just choose to take a snapshot for the first nodepool definition. I will help but I don't have the bandwidth to do it all by myself right away.

Jul 26 '22 11:07 mysticaltech

:D yeah same, I might have some time in September

Jul 26 '22 18:07 JustinGuese

After thinking about the networking aspect of this, we would need to have one dedicated nodepool for autoscaling (easy peasy) that we would not extend manually. We can, for instance, add an attribute autoscaling = true and select based on the value of this, and if true select the right server network (i.e., which is, in fact, a subnet) dedicated to that autoscaling nodepool. So, of course, no more manual deploys on that nodepool; we would ignore the count attribute anyway.

So basically, deploy one node at least in that nodepool, snapshot it, and then prepare all of the details that the Hetzner autoscaler needs, as cited above by @JustinGuese. That will be used to feed the autoscaler.

That said, we would need, from the get-go, at least a max_number_nodes_autoscaler variable as to properly reserve the server_network (IPs) for up of that number in the subnet created just for the autoscaler nodepool.

(To understand the above, just see how the current nodepools are created and networked together).

HCLOUD_CLOUD_INIT Base64 encoded Cloud Init yaml with commands to join the cluster, Sample [examples/cloud-init.txt for (Kubernetes 1.20.1)](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/examples/cloud-init.txt)

HCLOUD_IMAGE Defaults to ubuntu-20.04, @see https://docs.hetzner.cloud/#images. You can also use an image ID here (e.g. 15512617), or a label selector associated with a custom snapshot (e.g. customized_ubuntu=true). The most recent snapshot will be used in the latter case.

HCLOUD_NETWORK Default empty , The name of the network that is used in the cluster , @see https://docs.hetzner.cloud/#networks

HCLOUD_FIREWALL Default empty , The name of the firewall that is used in the cluster , @see https://docs.hetzner.cloud/#firewalls

HCLOUD_SSH_KEY Default empty , This SSH Key will have access to the fresh created server, @see https://docs.hetzner.cloud/#ssh-keys

This logic would need to live in an autoscaling.tf file, similar to agent.tf but not entirely the same.

Low-hanging fruit, folks! Pick it up, and I will help you!!

Aug 30 '22 00:08 mysticaltech

Is there somebody already working on this auto scaling feature? Any progress? Stuck? Feedback?

Sep 17 '22 13:09 codeagencybe

Hello @codeagencybe, I have exposed how I think it should flow above but have not yet had time to work on that feature. It probably needs just a few hours as the "path" seems obstacle free.

I can't say when I will take this on, but in the meantime, if anyone of you folks wants to give this a shot, please do so - I will be very responsive on the PR! 🙏

Sep 17 '22 19:09 mysticaltech

Hello @mysticaltech If I had the knowledge to create it, I would have done it for you but unfortunately I'm still learning about this language so I can't help you with the development. But I'm happy to help in any other point where I can, testing, documenting, translating, ... If there is anything like this that you need, let me know.

Sep 17 '22 20:09 codeagencybe

Im quite interesting in looking at this, but i can't really seem to find where the variables (snapshot id, etc) would be fed into? Here's the documentation i found: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/README.md which has this yaml: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/examples/cluster-autoscaler-run-on-master.yaml So, would we somehow create+apply the yaml from within Terraform, or what would your thoughts be @mysticaltech

Oct 12 '22 08:10 Nigma1337

@Nigma1337 Very glad to hear. You are indeed correct, it does not accept the Snapshot Id, but instead, a label used by the snapshot.

ksnip_20221012-153907

About the YAML, the easiest solution would be to add it to the templates folder, as hetzner_autoscaler_config.yaml.tpl, and load it and replaced the needed values as done for other template files in agents.tf.

Maybe the logic is simple enough to just live in agents.tf, no need to create another file as suggested above, but you see what feels best at the moment of implementation.

Oct 12 '22 13:10 mysticaltech

@Nigma1337 I have just downloaded the Github mobile app, so I will be more responsive to this issue if you decide to go through this. Let's get it done!!! 🚀 🤞

Oct 12 '22 13:10 mysticaltech

Made an initial commit on my fork https://github.com/Nigma1337/terraform-hcloud-kube-hetzner/tree/autoscaling

I can't really seem to figure out how i'd do the logic of only installing the autoscaler if max nodes > 0, i've never worked on a terraform project of this size.

Oct 13 '22 08:10 Nigma1337

Wonderful, @Nigma1337; it's a good start. Please do open a draft PR, so that I can contribute to it too.

Something that should be changed is the snapshoot.id, actually that is not needed or wanted; the selection is based on the unique label you give it (see above).

For the subnet logic, It's a bit delicate as the 10.0.0.0/16 CIDR is compartmentalized in a certain way for control plane and agent nodepools. Basically, control planes are 10.0..., 10.1.0..., 10.2.0..., and agents 10.255.0..., 10.254.0..., 10.253.0... So in our case, we can just take 10.255.0... for the initial support of 1 and only one autoscaling nodepool. Later on, we can expand to up to 10 autoscaling nodepools, but let's start simple first, haha. I propose contributing that logic, but please do not hesitate to try. You can find examples of how those CIDRs are calculated in control-planes.tf and agents.tf.

Also, the way I see the logic coming, best to separate it in it's own autoscaling.tf, no need for it to be either in agents.tf or main.tf.

And to simplify matters even more, let's just NOT give any definition of what that autoscaling nodepool might be. Let's just copy the definition of the last agent nodepool.

That way, we would need 3 variables only..... min_autoscaling_count, max_autscaling_count, enable_autoscaling (boolean). Please correct me if I am wrong!

This is looking good, my friend! This baby is coming soon 🚀

Oct 14 '22 09:10 mysticaltech

Wouldn't it be nicer so create snapshot with packer and use that snapshot then? Or use a snapshot from the first control-plane node (which is always needed)?

Oct 14 '22 10:10 ifeulner

As far as I know Packer isn't supported in Hetzner. @ifeulner have you ever did it?

Oct 14 '22 11:10 otavio

As @otavio said, Packer is not supported by the Kubernetes autoscaler (I do not see it working in that context at least). And why would you want to scale your control planes? We could do that later on, but for now, scaling a particular agent-nodepool (copying its definition at least, like the first or last one) is the priority, as it would provide something that works.

@ifeulner What you seem to want to do (deploy more nodes after the initial launch), you can already do right now, just by adding nodepools, or increasing the count of already present nodepools.

Oct 14 '22 11:10 mysticaltech

@otavio packer works on Hetzner in general to build snapshots, just updated it for ubuntu, see this repo.

In hcloud_server you can use a snapshot, so what needs to be done is to have a snapshot for MicroOS. This snapshot could also be used then in the CRDs for the autoscaler with a corresponding cloud-init.

Or do I miss something?

Oct 14 '22 13:10 ifeulner

@mysticaltech it's not about scaling the controlplanes, just creating a proper snapshot out of the first node to be used later.

Oct 14 '22 13:10 ifeulner

@ifeulner I understand but it's not needed, we can just ask Hetzner to create a snapshot of the first node in the last agent nodepool. But thanks for sharing those alternative ideas 🙏

Oct 14 '22 13:10 mysticaltech

That's right but doing it on the first would avoid doing the whole download-and-install procedure for the microos image for the additional nodes.

Oct 14 '22 13:10 ifeulner

@ifeulner I get this, but we must take a snapshot of a live image, as the conditions and configurations will be very similar. But noted now we know of that other option we could leverage!

Oct 14 '22 13:10 mysticaltech

@BlakeB415 @JustinGuese @codeagencybe @ifeulner Thanks to the initial work of @Nigma1337, we now have a PR #352 that at least deploys successfully the autoscaler.

Heavy testing is needed and please do not hesitate to open more PRs pointing to the autoscaling branch, I will review and merge those ASAP.

I count on your cooperation, let's make this "dream" feature come true! ✨

ksnip_20221015-131440

Oct 15 '22 11:10 mysticaltech

terraform-hcloud-kube-hetzner terraform-hcloud-kube-hetzner copied to clipboard

Use with cluster autoscaler

terraform-hcloud-kube-hetzner
terraform-hcloud-kube-hetzner copied to clipboard