terraform-hcloud-kube-hetzner icon indicating copy to clipboard operation
terraform-hcloud-kube-hetzner copied to clipboard

System-upgrade-controller fails to run

Open jniebuhr opened this issue 1 year ago • 4 comments

Description

Since a recent upgrade of the module, my system-upgrade-controller no longer manages to perform any upgrades. This is the error message:

level=error msg="error syncing 'system-upgrade/k3s-agent': handler system-upgrade-controller: Get \"https://update.k3s.io/v1-release/channels/v1.29\": tls: failed to verify certificate: x509: failed to load system roots and no roots provided; open /etc/ssl/certs/ca-certificates.crt: permission denied, requeuing"

The file does not exist on the host, instead there's only a ca-bundle.pem. I've recreated the system-upgrade-controller by tainting the kustomization resource but it's not fixing it. I wonder if my system images are simply too old and I need to recreate my control plane nodes?

Kube.tf file

module "cluster" {
  providers = {
    hcloud = hcloud
  }

  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.13.5"

  hcloud_token    = var.hcloud_token
  ssh_public_key  = file(var.public_key)
  ssh_private_key = file(var.private_key)
  network_region  = "eu-central"

  extra_firewall_rules = [
    **
  ]

  k3s_global_kubelet_args = [
    "kube-reserved=cpu=100m,memory=200Mi,ephemeral-storage=1Gi", "system-reserved=cpu=100m,memory=200Mi", "image-gc-high-threshold=50",
    "image-gc-low-threshold=40", "allowed-unsafe-sysctls=net.ipv4.ip_forward,net.ipv6.conf.all.disable_ipv6"
  ]

  control_plane_nodepools = [
    {
      name        = "control-plane-fsn1",
      server_type = "cpx11",
      location    = "fsn1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-nbg1",
      server_type = "cpx11",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 1
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx11",
      location    = "hel1",
      labels      = [],
      taints      = [],
      count       = 1
    }
  ]

  agent_nodepools = [
    {
      name        = "ag0",
      server_type = "cpx31",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      count       = 2
    },
    {
      name        = "ag1",
      server_type = "cpx41",
      location    = "fsn1",
      labels      = ["*/designation=*"],
      taints      = ["node.kubernetes.io/role=*:NoSchedule"],
      count       = 1
    },
  ]

  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"

  automatically_upgrade_k3s = true
  enable_cert_manager       = true
  ingress_controller        = "none"

  cluster_name = "***"
}

Screenshots

No response

Platform

Mac

jniebuhr avatar Apr 06 '24 11:04 jniebuhr

@jniebuhr Make sure to upgrade all packages with terraform init -upgrade and try again.

If that fails, please remove both extra_firewall_rules and k3s_global_kubelet_args and try again too.

mysticaltech avatar Apr 09 '24 02:04 mysticaltech

If the above do not work too, you could also indeed try upgrading the node itself, see our readme.

mysticaltech avatar Apr 09 '24 02:04 mysticaltech

Quick fix: run container privileged

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        image: rancher/system-upgrade-controller:v0.13.4
        name: system-upgrade-controller
        securityContext:
          privileged: true

kossmac avatar Apr 26 '24 07:04 kossmac

@kossmac Super interesting tip. Did you have that issue too? So you can confirm that this solution work?

@M4t7e Do you think safety wise that's ok?

mysticaltech avatar May 01 '24 08:05 mysticaltech

This issue seems rare enough, the fix above is interesting. Closing for now, will reopen if the issue gets more activity.

mysticaltech avatar May 23 '24 18:05 mysticaltech