terraform-hcloud-kube-hetzner The "allow_scheduling_on_control_plane = true" flag causes Timeout during creation

Hi, I am having a timeout issue when creating a cluster. This is the tail of the output:

module.kube-hetzner.null_resource.kustomization (remote-exec): Waiting for the system-upgrade-controller deployment to become available... module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade wait --for=condition=available --timeout=180s deployment/system-upgrade-controller module.kube-hetzner.null_resource.kustomization: Still creating... [30s elapsed] ... module.kube-hetzner.null_resource.kustomization: Still creating... [3m21s elapsed] module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller ╷ │ Error: remote-exec provisioner error │ │ with module.kube-hetzner.null_resource.kustomization, │ on .terraform/modules/kube-hetzner/init.tf line 231, in resource "null_resource" "kustomization": │ 231: provisioner "remote-exec" { │ │ error executing "/tmp/terraform_39240884.sh": Process exited with status 1 ╵

I have retried a few times. Also destroyed and retried again.

Here is my kube.tf:

locals {
  hcloud_token = "<token>"
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token

  source = "kube-hetzner/kube-hetzner/hcloud"

  ssh_public_key = file("/home/username/.ssh/id_ed25519.pub")
  ssh_private_key = file("/home/username/.ssh/id_ed25519")

  network_region = "us-east"

  control_plane_nodepools = [
    {
      name        = "control-plane",
      server_type = "cpx11",
      location    = "ash",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-small",
      server_type = "cpx11",
      location    = "ash",
      labels      = [],
      taints      = [],
      count       = 0
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "ash"

  allow_scheduling_on_control_plane = true
  enable_cert_manager = true
  cluster_name = "k3s"

  extra_firewall_rules = [
    # For Postgres
    {
      direction       = "in"
      protocol        = "tcp"
      port            = "5432"
      source_ips      = ["<my_ip>/32", "::/0"]
      destination_ips = [] # Won't be used for this rule
    },
    # To Allow ArgoCD access to resources via SSH
    {
      direction       = "out"
      protocol        = "tcp"
      port            = "22"
      source_ips      = [] # Won't be used for this rule
      destination_ips = ["0.0.0.0/0", "::/0"]
    }
  ]
}

provider "hcloud" {
  token = local.hcloud_token
}

terraform {
  required_version = ">= 1.2.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.35.1"
    }
  }
}

Sep 19 '22 14:09 cc-nogueira

@cc-nogueira Thanks for uncovering that issue. It seems caused by the fact that the "LegacyNodeRoleBehavior=false" CCM feature-gate is used when allow_scheduling_on_control_plane is set to true, and that is not compatible with the latest versions of Kubernetes libraries used in the latest Hetzner CCM.

So, the only quick fix, for now, is to set that to false! I will investigate if there are other ways to allow scheduling on the control plane nodes so that the LB picks it up as a target.

Sep 19 '22 15:09 mysticaltech

@cc-nogueira Alternatively, you can keep the previous flag to true and try setting a previous version of the CCM with hetzner_ccm_version="v1.12.1". This could very well work, please let me know!

Sep 19 '22 16:09 mysticaltech

It's confirmed that with the above, everything works fine! I have updated the docs to include the ccm version when setting allow_scheduling_on_control_plane to true.

ksnip_20220919-225220

I will keep monitoring for other solutions that would allow us to use the latest CCM in that instance.

Sep 19 '22 20:09 mysticaltech

Beautiful! All right here with:

hetzner_ccm_version="v1.12.1"

Thanks Karim

Sep 20 '22 00:09 cc-nogueira

The CCM version fix should now be removed, all works well starting in v1.5.10.

Oct 07 '22 14:10 mysticaltech

Sorry for touching this closed issue again, but I still get this error:

module.kube-hetzner.null_resource.kustomization (remote-exec): + kubectl -n system-upgrade wait --for=condition=available --timeout=180s deployment/system-upgrade-controller ... module.kube-hetzner.null_resource.kustomization (remote-exec): error: timed out waiting for the condition on deployments/system-upgrade-controller

Extract from my kube.tf file:

load_balancer_type = "lb11" load_balancer_location = "nbg1"

enable_klipper_metal_lb = "true" allow_scheduling_on_control_plane = true automatically_upgrade_k3s = true automatically_upgrade_os = false enable_cert_manager = true

If I understand the fix correctly, setting allow_scheduling_on_control_plane to true should trigger the correct behavior.

What am I doing wrong here?

Thanks a lot, Thomas

Nov 04 '22 07:11 thomasletsch

@thomasletsch Yes, please share your full kube.tf (just not the sensitive values, and stripped of comments please), as if you are using ranching you need a node of a certain size, it does not work with < cx21.

Nov 04 '22 16:11 mysticaltech

And make sure to remove the fixed version for your CCM, it needs to use the latest and greatest on its own. Last but not least, please make sure to use the latest release!

Nov 04 '22 16:11 mysticaltech

Thanks for looking into this! Here it comes:

locals {
  hcloud_token = ""
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = local.hcloud_token

  source = "kube-hetzner/kube-hetzner/hcloud"

  ssh_public_key = file("/home/thomas/.ssh/hetzner-key.pub")
  ssh_private_key = file("/home/thomas/.ssh/hetzner-key")

  hcloud_ssh_key_id = "Hetzner Key"
  network_region = "eu-central" # change to `us-east` if location is ash
  cluster_name = "homemeter"

  control_plane_nodepools = [
    {
      name        = "master",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "agent",
      server_type = "cx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 0
    }
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  enable_klipper_metal_lb = "true"
  allow_scheduling_on_control_plane = true
  automatically_upgrade_k3s = true
  automatically_upgrade_os = false
  enable_cert_manager = true
}

provider "hcloud" {
  token = local.hcloud_token
}

terraform {
  required_version = ">= 1.2.0"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.35.1"
    }
  }
}

output "kubeconfig" {
    value = module.kube-hetzner.kubeconfig_file
    sensitive = true
}

Nov 04 '22 16:11 thomasletsch

Good news, its working! Just tried several things, at the end I tried again with setting the ccm version to 1.2.1 and it failed at some later point. Without cleaning everything up, I just tried it without fixed version and this time it went through.

Now even after cleaning all terraform objects, it works without any problem. Have no glue what really the problem was, but as it works reproducible now (and for all others as well if I see it correctly), I am very happy with all.

Anyway thanks for the great tool and the support!

Nov 17 '22 16:11 thomasletsch

terraform-hcloud-kube-hetzner terraform-hcloud-kube-hetzner copied to clipboard

The "allow_scheduling_on_control_plane = true" flag causes Timeout during creation

terraform-hcloud-kube-hetzner
terraform-hcloud-kube-hetzner copied to clipboard