terraform-hcloud-kube-hetzner
terraform-hcloud-kube-hetzner copied to clipboard
Use with cluster autoscaler
How would this be used with cluster autoscaler? Does this module support cluster autoscaler? If so, how would I configure it?
Hey @BlakeB415, we do not support cluster autoscaler yet! PRs are always welcome!
Chipping in on the matter, for that I've been testing Kubeone's Kubermatic, it will get you a cluster on Hetzner with a Machine Controller able to scale the number of physical nodes like you scale a Deployment.
That said I very much prefer this project's simplicity and all the goodies it has, so seeing that functionality here would be amazing.
Maybe it is possible to install their machine controller on a cluster provisioned with this project, haven't tried yet.
yes, an autoscaling cluster would be the only missing link and it's perfect in my opinion... Idk how it works, but could the masternode contain/create a limited usage hetzner-api that triggers a scale up on >80% load or something?
probably not really secure, right?
@BlakeB415 cluster-autoscaler seems to contain hetzner already https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider
so it should in theory not be too much work? I'm open to take a look if it's not too much
@JustinGuese Exactly, I imagine that the Kubernetes native autoscaler would be ideal to "sense" the need to autoscale, but this project runs with Terraform, so the actual autoscaling needs to be done by changing the node count at the nodepool level, in the kube.tf
file. Recently I stumbled upon this project that maybe could help https://github.com/runatlantis/atlantis.
Maybe combining the two would get us something working?! It would of course require us to run Atlantis somewhere. We could also couple that with a simple NextJS web app (or Python Streamlit app) that would allow us to manage all of that, and bake in more "intelligence".
The latter option would require a side project, we could call it KubeHetzner UI, or something similar. But all these are just suppositions, maybe there is a shorter path, through a Rancher server for instance?!
Disclaimer: The following is just my view on things and may not be the best idea
The number of non-controlplane nodes is not something Terraform is good for, that should be the job of the autoscaler/machine controller. Ideally we define the nodepools in terraform like we are doing now but then kubernetes native systems manage how many nodes per pool to provision.
The magic on how to provision the nodes without a user armed with Terraform on their computer/cd system is coded by Kubermatic's Machine controller but how it actually works escapes me. From my limited testing of Kubeone, only the control plane is provisioned with Terraform and the worker pools are all managed through the Machine Controller.
There's also this https://github.com/syself/cluster-api-provider-hetzner but it goes a step beyond and gets rid of terraform entirely, which I also need to try.
@p4block It's very interesting, and we could do that by saving a worker node snapshot "image" during the deployment, and later let the autoscaler use that to deploy new worker nodes.
Exactly how to do that needs to be researched, possibly it can be done all in terraform, but at least I know it can be done via the hcloud
cli.
Really worth researching more! PRs welcome :)
@mysticaltech yeah with an external service for sure, but i guess we're leaving open source then...
something i could imagine with the native solution would be to somehow grab the requirements from terraform and insert them in the end, i guess the hardest would be like @p4block said the image of a worker node...
the native scaler requires:
HCLOUD_CLOUD_INIT Base64 encoded Cloud Init yaml with commands to join the cluster, Sample [examples/cloud-init.txt for (Kubernetes 1.20.1)](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/examples/cloud-init.txt)
HCLOUD_IMAGE Defaults to ubuntu-20.04, @see https://docs.hetzner.cloud/#images. You can also use an image ID here (e.g. 15512617), or a label selector associated with a custom snapshot (e.g. customized_ubuntu=true). The most recent snapshot will be used in the latter case.
HCLOUD_NETWORK Default empty , The name of the network that is used in the cluster , @see https://docs.hetzner.cloud/#networks
HCLOUD_FIREWALL Default empty , The name of the firewall that is used in the cluster , @see https://docs.hetzner.cloud/#firewalls
HCLOUD_SSH_KEY Default empty , This SSH Key will have access to the fresh created server, @see https://docs.hetzner.cloud/#ssh-keys
do you have an idea if we can grab these values? I mean except the image it should be cool?
@JustinGuese Thanks for extracting and sharing those details. We have all these values!! The only thing missing is creating a snapshot at the end of the install and passing its ID to HCLOUD_IMAGE
.
It's completely doable. But I just do not have the bandwidth right away to work on this, but I will help the best I can.
That will help get us there https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/snapshot.
This problem has simplified a lot with the above findings. Just one thing ideally we would create different snapshots for each kind of agent nodepools. That way, choosing the kind of server we want to autoscale with would be easy.
Folks, no one wants to give that a shot? It's not that hard, especially if we just choose to take a snapshot for the first nodepool definition. I will help but I don't have the bandwidth to do it all by myself right away.
:D yeah same, I might have some time in September
After thinking about the networking aspect of this, we would need to have one dedicated nodepool for autoscaling (easy peasy) that we would not extend manually. We can, for instance, add an attribute autoscaling = true
and select based on the value of this, and if true select the right server network (i.e., which is, in fact, a subnet) dedicated to that autoscaling nodepool. So, of course, no more manual deploys on that nodepool; we would ignore the count
attribute anyway.
So basically, deploy one node at least in that nodepool, snapshot it, and then prepare all of the details that the Hetzner autoscaler needs, as cited above by @JustinGuese. That will be used to feed the autoscaler.
That said, we would need, from the get-go, at least a max_number_nodes_autoscaler
variable as to properly reserve the server_network (IPs) for up of that number in the subnet created just for the autoscaler nodepool.
(To understand the above, just see how the current nodepools are created and networked together).
HCLOUD_CLOUD_INIT Base64 encoded Cloud Init yaml with commands to join the cluster, Sample [examples/cloud-init.txt for (Kubernetes 1.20.1)](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/examples/cloud-init.txt)
HCLOUD_IMAGE Defaults to ubuntu-20.04, @see https://docs.hetzner.cloud/#images. You can also use an image ID here (e.g. 15512617), or a label selector associated with a custom snapshot (e.g. customized_ubuntu=true). The most recent snapshot will be used in the latter case.
HCLOUD_NETWORK Default empty , The name of the network that is used in the cluster , @see https://docs.hetzner.cloud/#networks
HCLOUD_FIREWALL Default empty , The name of the firewall that is used in the cluster , @see https://docs.hetzner.cloud/#firewalls
HCLOUD_SSH_KEY Default empty , This SSH Key will have access to the fresh created server, @see https://docs.hetzner.cloud/#ssh-keys
This logic would need to live in an autoscaling.tf
file, similar to agent.tf
but not entirely the same.
Low-hanging fruit, folks! Pick it up, and I will help you!!
Is there somebody already working on this auto scaling feature? Any progress? Stuck? Feedback?
Hello @codeagencybe, I have exposed how I think it should flow above but have not yet had time to work on that feature. It probably needs just a few hours as the "path" seems obstacle free.
I can't say when I will take this on, but in the meantime, if anyone of you folks wants to give this a shot, please do so - I will be very responsive on the PR! 🙏
Hello @mysticaltech If I had the knowledge to create it, I would have done it for you but unfortunately I'm still learning about this language so I can't help you with the development. But I'm happy to help in any other point where I can, testing, documenting, translating, ... If there is anything like this that you need, let me know.
Im quite interesting in looking at this, but i can't really seem to find where the variables (snapshot id, etc) would be fed into? Here's the documentation i found: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/README.md which has this yaml: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/hetzner/examples/cluster-autoscaler-run-on-master.yaml So, would we somehow create+apply the yaml from within Terraform, or what would your thoughts be @mysticaltech
@Nigma1337 Very glad to hear. You are indeed correct, it does not accept the Snapshot Id, but instead, a label used by the snapshot.
About the YAML, the easiest solution would be to add it to the templates folder, as hetzner_autoscaler_config.yaml.tpl
, and load it and replaced the needed values as done for other template files in agents.tf.
Maybe the logic is simple enough to just live in agents.tf
, no need to create another file as suggested above, but you see what feels best at the moment of implementation.
@Nigma1337 I have just downloaded the Github mobile app, so I will be more responsive to this issue if you decide to go through this. Let's get it done!!! 🚀 🤞
Made an initial commit on my fork https://github.com/Nigma1337/terraform-hcloud-kube-hetzner/tree/autoscaling
I can't really seem to figure out how i'd do the logic of only installing the autoscaler if max nodes > 0, i've never worked on a terraform project of this size.
Wonderful, @Nigma1337; it's a good start. Please do open a draft PR, so that I can contribute to it too.
Something that should be changed is the snapshoot.id, actually that is not needed or wanted; the selection is based on the unique label you give it (see above).
For the subnet logic, It's a bit delicate as the 10.0.0.0/16 CIDR is compartmentalized in a certain way for control plane and agent nodepools. Basically, control planes are 10.0..., 10.1.0..., 10.2.0..., and agents 10.255.0..., 10.254.0..., 10.253.0... So in our case, we can just take 10.255.0... for the initial support of 1 and only one autoscaling nodepool. Later on, we can expand to up to 10 autoscaling nodepools, but let's start simple first, haha. I propose contributing that logic, but please do not hesitate to try. You can find examples of how those CIDRs are calculated in control-planes.tf and agents.tf.
Also, the way I see the logic coming, best to separate it in it's own autoscaling.tf
, no need for it to be either in agents.tf or main.tf.
And to simplify matters even more, let's just NOT give any definition of what that autoscaling nodepool might be. Let's just copy the definition of the last agent nodepool.
That way, we would need 3 variables only..... min_autoscaling_count
, max_autscaling_count
, enable_autoscaling
(boolean). Please correct me if I am wrong!
This is looking good, my friend! This baby is coming soon 🚀
Wouldn't it be nicer so create snapshot with packer and use that snapshot then? Or use a snapshot from the first control-plane node (which is always needed)?
As far as I know Packer isn't supported in Hetzner. @ifeulner have you ever did it?
As @otavio said, Packer is not supported by the Kubernetes autoscaler (I do not see it working in that context at least). And why would you want to scale your control planes? We could do that later on, but for now, scaling a particular agent-nodepool (copying its definition at least, like the first or last one) is the priority, as it would provide something that works.
@ifeulner What you seem to want to do (deploy more nodes after the initial launch), you can already do right now, just by adding nodepools, or increasing the count of already present nodepools.
@otavio packer works on Hetzner in general to build snapshots, just updated it for ubuntu, see this repo.
In hcloud_server
you can use a snapshot, so what needs to be done is to have a snapshot for MicroOS.
This snapshot could also be used then in the CRDs for the autoscaler with a corresponding cloud-init.
Or do I miss something?
@mysticaltech it's not about scaling the controlplanes, just creating a proper snapshot out of the first node to be used later.
@ifeulner I understand but it's not needed, we can just ask Hetzner to create a snapshot of the first node in the last agent nodepool. But thanks for sharing those alternative ideas 🙏
That's right but doing it on the first would avoid doing the whole download-and-install procedure for the microos image for the additional nodes.
@ifeulner I get this, but we must take a snapshot of a live image, as the conditions and configurations will be very similar. But noted now we know of that other option we could leverage!
@BlakeB415 @JustinGuese @codeagencybe @ifeulner Thanks to the initial work of @Nigma1337, we now have a PR #352 that at least deploys successfully the autoscaler.
Heavy testing is needed and please do not hesitate to open more PRs pointing to the autoscaling
branch, I will review and merge those ASAP.
I count on your cooperation, let's make this "dream" feature come true! ✨