terraform-hcloud-kube-hetzner
terraform-hcloud-kube-hetzner copied to clipboard
The "allow_scheduling_on_control_plane = true" flag causes Timeout during creation
Hi, I am having a timeout issue when creating a cluster. This is the tail of the output:
module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for the system-upgrade-controller deployment to become available... module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade wait --for=condition=available --timeout=180s deployment/system-upgrade-controller module.kube-hetzner.null_resource.kustomization: Still creating... [30s elapsed] ... module.kube-hetzner.null_resource.kustomization: Still creating... [3m21s elapsed] module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller ╷ │ Error: remote-exec provisioner error │ │ with module.kube-hetzner.null_resource.kustomization, │ on .terraform/modules/kube-hetzner/init.tf line 231, in resource "null_resource" "kustomization": │ 231: provisioner "remote-exec" { │ │ error executing "/tmp/terraform_39240884.sh": Process exited with status 1 ╵
I have retried a few times. Also destroyed and retried again.
Here is my kube.tf:
locals {
hcloud_token = "<token>"
}
module "kube-hetzner" {
providers = {
hcloud = hcloud
}
hcloud_token = local.hcloud_token
source = "kube-hetzner/kube-hetzner/hcloud"
ssh_public_key = file("/home/username/.ssh/id_ed25519.pub")
ssh_private_key = file("/home/username/.ssh/id_ed25519")
network_region = "us-east"
control_plane_nodepools = [
{
name = "control-plane",
server_type = "cpx11",
location = "ash",
labels = [],
taints = [],
count = 1
}
]
agent_nodepools = [
{
name = "agent-small",
server_type = "cpx11",
location = "ash",
labels = [],
taints = [],
count = 0
}
]
load_balancer_type = "lb11"
load_balancer_location = "ash"
allow_scheduling_on_control_plane = true
enable_cert_manager = true
cluster_name = "k3s"
extra_firewall_rules = [
# For Postgres
{
direction = "in"
protocol = "tcp"
port = "5432"
source_ips = ["<my_ip>/32", "::/0"]
destination_ips = [] # Won't be used for this rule
},
# To Allow ArgoCD access to resources via SSH
{
direction = "out"
protocol = "tcp"
port = "22"
source_ips = [] # Won't be used for this rule
destination_ips = ["0.0.0.0/0", "::/0"]
}
]
}
provider "hcloud" {
token = local.hcloud_token
}
terraform {
required_version = ">= 1.2.0"
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = ">= 1.35.1"
}
}
}
@cc-nogueira Thanks for uncovering that issue. It seems caused by the fact that the "LegacyNodeRoleBehavior=false" CCM feature-gate is used when allow_scheduling_on_control_plane is set to true, and that is not compatible with the latest versions of Kubernetes libraries used in the latest Hetzner CCM.
So, the only quick fix, for now, is to set that to false! I will investigate if there are other ways to allow scheduling on the control plane nodes so that the LB picks it up as a target.
@cc-nogueira Alternatively, you can keep the previous flag to true and try setting a previous version of the CCM with hetzner_ccm_version="v1.12.1". This could very well work, please let me know!
It's confirmed that with the above, everything works fine! I have updated the docs to include the ccm version when setting allow_scheduling_on_control_plane to true.

I will keep monitoring for other solutions that would allow us to use the latest CCM in that instance.
Beautiful! All right here with:
hetzner_ccm_version="v1.12.1"
Thanks Karim
The CCM version fix should now be removed, all works well starting in v1.5.10.
Sorry for touching this closed issue again, but I still get this error:
module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade wait --for=condition=available --timeout=180s deployment/system-upgrade-controller
...
module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller
Extract from my kube.tf file:
load_balancer_type = "lb11" load_balancer_location = "nbg1"
enable_klipper_metal_lb = "true" allow_scheduling_on_control_plane = true automatically_upgrade_k3s = true automatically_upgrade_os = false enable_cert_manager = true
If I understand the fix correctly, setting allow_scheduling_on_control_plane to true should trigger the correct behavior.
What am I doing wrong here?
Thanks a lot, Thomas
@thomasletsch Yes, please share your full kube.tf (just not the sensitive values, and stripped of comments please), as if you are using ranching you need a node of a certain size, it does not work with < cx21.
And make sure to remove the fixed version for your CCM, it needs to use the latest and greatest on its own. Last but not least, please make sure to use the latest release!
Thanks for looking into this! Here it comes:
locals {
hcloud_token = ""
}
module "kube-hetzner" {
providers = {
hcloud = hcloud
}
hcloud_token = local.hcloud_token
source = "kube-hetzner/kube-hetzner/hcloud"
ssh_public_key = file("/home/thomas/.ssh/hetzner-key.pub")
ssh_private_key = file("/home/thomas/.ssh/hetzner-key")
hcloud_ssh_key_id = "Hetzner Key"
network_region = "eu-central" # change to `us-east` if location is ash
cluster_name = "homemeter"
control_plane_nodepools = [
{
name = "master",
server_type = "cx21",
location = "nbg1",
labels = [],
taints = [],
count = 1
}
]
agent_nodepools = [
{
name = "agent",
server_type = "cx21",
location = "nbg1",
labels = [],
taints = [],
count = 0
}
]
load_balancer_type = "lb11"
load_balancer_location = "nbg1"
enable_klipper_metal_lb = "true"
allow_scheduling_on_control_plane = true
automatically_upgrade_k3s = true
automatically_upgrade_os = false
enable_cert_manager = true
}
provider "hcloud" {
token = local.hcloud_token
}
terraform {
required_version = ">= 1.2.0"
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = ">= 1.35.1"
}
}
}
output "kubeconfig" {
value = module.kube-hetzner.kubeconfig_file
sensitive = true
}
Good news, its working! Just tried several things, at the end I tried again with setting the ccm version to 1.2.1 and it failed at some later point. Without cleaning everything up, I just tried it without fixed version and this time it went through.
Now even after cleaning all terraform objects, it works without any problem. Have no glue what really the problem was, but as it works reproducible now (and for all others as well if I see it correctly), I am very happy with all.
Anyway thanks for the great tool and the support!