cluster-api-provider-hetzner icon indicating copy to clipboard operation
cluster-api-provider-hetzner copied to clipboard

CAPH Controller panics during HetznerCluster reconciliation

Open Dual-0 opened this issue 6 months ago • 2 comments
trafficstars

Description:

When attempting to provision a Kubernetes cluster on Hetzner Cloud using Cluster API (CAPI) with Cluster API Provider Hetzner (CAPH), the CAPH controller (caph-controller-manager) enters a CrashLoopBackOff state shortly after the HetznerCluster resource is applied.

The controller logs show a persistent panic: runtime error: invalid memory address or nil pointer dereference occurring specifically within the load balancer related code (github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.createOptsFromSpec).

This panic prevents the controller from reconciling the HetznerCluster object and subsequently provisioning any infrastructure resources on Hetzner Cloud. The CAPH webhook service also becomes unavailable as a result, leading to webhook errors when applying/updating resources.

Steps to reproduce:

  1. Set up a local Kubernetes cluster (using Kind) to serve as the Cluster API management cluster.
  2. Install Cluster API core components.
  3. Install Cluster API Bootstrap Provider Talos (CABPT) and Cluster API Control Plane Provider Talos (CACPPT).
  4. Install Cluster API Provider Hetzner (CAPH) version [Insert CAPH Version Here, e.g., from Image Tag].
  5. Create a Kubernetes Secret in the management cluster with the Hetzner Cloud API token.
  6. Apply a manifest defining a CAPI Cluster and associated Hetzner/Talos infrastructure resources (see Configuration section below). This manifest includes a HetznerCluster resource with controlPlaneLoadBalancer.enabled: false and a placeholder controlPlaneEndpoint.
  7. Observe the logs of the caph-controller-manager pod in the caph-system namespace.

Expected behavior:

The CAPH controller should successfully reconcile the HetznerCluster resource, interact with the Hetzner Cloud API to provision the necessary infrastructure (like the private network and placement groups), and proceed without panicking.

Actual behavior:

The caph-controller-manager pod crashes repeatedly and enters CrashLoopBackOff. Logs show a panic: runtime error: invalid memory address or nil pointer dereference traceback related to load balancer logic. The HetznerCluster resource remains in a state indicating provisioning issues, and no infrastructure is provisioned on Hetzner Cloud.

Environment:

.env

export CLUSTER_NAME="my-cluster" # Choose a unique name for your cluster
export HCLOUD_REGION="fsn1" # Your desired server region (e.g., fsn1)
export HCLOUD_NETWORK_ZONE="eu-central" # Network Zone (must be one of the supported values from the error)
export HCLOUD_SSH_KEY="cluster" # The name of the SSH key in your Hetzner project
export HCLOUD_CONTROL_PLANE_MACHINE_TYPE="cax11" # Server type for control planes
export HCLOUD_WORKER_MACHINE_TYPE="cax21" # Server type for workers
export HCLOUD_TALOS_IMAGE_ID="ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515" # The Image ID you found in Step 2a
export CONTROL_PLANE_MACHINE_COUNT=1 # Number of control plane nodes (recommended 3 for HA)
export WORKER_MACHINE_COUNT=1 # Initial number of worker nodes (can be 0 for autoscaling later)
export KUBERNETES_VERSION="v1.31.8" # The Kubernetes version
export TALOS_VERSION="v1.10" # The Talos version compatible with your K8s version

hetzner-talos-cluster.yaml

# gitops/infrastructure/hetzner-talos-cluster.yaml

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: ${CLUSTER_NAME}
  namespace: default # Or your target namespace
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 10.244.0.0/16 # Default for many CNIs, adjust if needed
    services:
      cidrBlocks:
        - 10.96.0.0/12 # Default for many CNIs, adjust if needed
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: ${CLUSTER_NAME}-controlplane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: HetznerCluster
    name: ${CLUSTER_NAME}
---
# HetznerCluster resource - using v1beta1
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
metadata:
  name: ${CLUSTER_NAME}
  namespace: default
spec:
  hcloudNetwork:
    enabled: true
    networkZone: ${HCLOUD_NETWORK_ZONE} # e.g., "eu-central"
    # cidrBlock: 10.0.0.0/16 # Optional: specify if you want a specific private network CIDR
    # subnetCidrBlock: 10.0.0.0/24 # Optional: specify if you want a specific subnet CIDR

  controlPlaneLoadBalancer:
    enabled: true
    region: ${HCLOUD_REGION} # not in docs but needed
    type: lb11 # Specify the Load Balancer type (e.g., lb11, lb21)

  sshKeys:
    hcloud:
      - name: ${HCLOUD_SSH_KEY}

  hetznerSecretRef: # Reference the secret containing Hetzner token
    name: hetzner # Name of the secret in the management cluster (default namespace)
    key:
      hcloudToken: hcloud # Assuming your secret key is named 'hcloud'

  hcloudPlacementGroups:
  - name: ${CLUSTER_NAME}-controlplane
    type: spread
  - name: ${CLUSTER_NAME}-worker
    type: spread

  controlPlaneRegions:
    - "${HCLOUD_REGION}" # Use your defined region here, must be an array

---
# Template for Control Plane Machines' infrastructure (HCloud specific) - using v1beta1
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudMachineTemplate
metadata:
  name: ${CLUSTER_NAME}-controlplane
  namespace: default
spec:
  template:
    spec:
      type: ${HCLOUD_CONTROL_PLANE_MACHINE_TYPE}
      imageName: ${HCLOUD_TALOS_IMAGE_ID} # Use the Image/Snapshot ID
      placementGroupName: ${CLUSTER_NAME}-controlplane
---
# Template for Worker Machines' infrastructure (HCloud specific) - using v1beta1
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudMachineTemplate
metadata:
  name: ${CLUSTER_NAME}-worker
  namespace: default
spec:
  template:
    spec:
      type: ${HCLOUD_WORKER_MACHINE_TYPE}
      imageName: ${HCLOUD_TALOS_IMAGE_ID} # Use the Image/Snapshot ID
      placementGroupName: ${CLUSTER_NAME}-worker
---
# Talos bootstrap and control plane configuration for Control Plane nodes
# Use v1alpha3 for TalosControlPlane
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: ${CLUSTER_NAME}-controlplane
  namespace: default
spec:
  replicas: ${CONTROL_PLANE_MACHINE_COUNT}
  version: ${KUBERNETES_VERSION}
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: HCloudMachineTemplate
    name: ${CLUSTER_NAME}-controlplane
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: ${TALOS_VERSION}
      strategicPatches:
        - |
          cluster:
            externalCloudProvider:
              enabled: true
        - |
          cluster:
            network:
              cni: null
---
# MachineDeployment for Worker nodes - using v1beta1
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: ${CLUSTER_NAME}-worker-pool
  namespace: default
spec:
  clusterName: ${CLUSTER_NAME}
  replicas: ${WORKER_MACHINE_COUNT}
  selector:
    matchLabels: null
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: TalosConfigTemplate
          name: ${CLUSTER_NAME}-worker
      clusterName: ${CLUSTER_NAME}
      version: ${KUBERNETES_VERSION}
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: HCloudMachineTemplate
        name: ${CLUSTER_NAME}-worker
---
# Talos bootstrap configuration template for Worker nodes
# Use v1alpha3 for TalosConfigTemplate
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
metadata:
  name: ${CLUSTER_NAME}-worker
  namespace: default
spec:
  template:
    spec:
      # These fields were correct as per the TalosConfigTemplate schema
      generateType: worker
      talosVersion: ${TALOS_VERSION}
      strategicPatches:
        - |
          cluster:
            externalCloudProvider:
              enabled: true
        - |
          cluster:
            network:
              cni: null
---

Logs:

{"level":"INFO","time":"2025-05-07T14:29:15.090Z","file":"controllers/hcloudmachinetemplate_controller.go:92","message":"HCloudMachineTemplate is missing ownerRef to cluster or cluster does not exist default/my-cluster-controlplane","controller":"hcloudmachinetemplate","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HCloudMachineTemplate","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"},"namespace":"default","name":"my-cluster-controlplane","reconcileID":"67aa9fa6-e7de-4148-8a32-2240d1ca9ab8","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"}}
{"level":"INFO","time":"2025-05-07T14:29:15.172Z","file":"controllers/hcloudmachinetemplate_controller.go:92","message":"HCloudMachineTemplate is missing ownerRef to cluster or cluster does not exist default/my-cluster-controlplane","controller":"hcloudmachinetemplate","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HCloudMachineTemplate","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"},"namespace":"default","name":"my-cluster-controlplane","reconcileID":"f13c8e01-b69c-4731-ac9b-8dccc4fdc35a","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"}}
{"level":"INFO","time":"2025-05-07T14:29:15.211Z","file":"controller/controller.go:110","message":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"hetznercluster","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerCluster","HetznerCluster":{"name":"my-cluster","namespace":"default"},"namespace":"default","name":"my-cluster","reconcileID":"85eab0d5-cbf7-4fb8-bae3-6f967bab82f7"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x144e678]

goroutine 385 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:111 +0x19c
panic({0x16a3180?, 0x2bbdda0?})
        runtime/panic.go:791 +0x124
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.createOptsFromSpec(0x400061d508)
        github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer/loadbalancer.go:326 +0x1b8
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.(*Service).createLoadBalancer(0x4000aa5828, {0x1c60e28, 0x40008152f0})
        github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer/loadbalancer.go:290 +0x3c
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.(*Service).Reconcile(0x4000aa5828, {0x1c60e28, 0x40008152f0})
        github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer/loadbalancer.go:81 +0x1f0
github.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal(0x400038a380, {0x1c60e28, 0x40008152f0}, 0x400033a690)
        github.com/syself/cluster-api-provider-hetzner/controllers/hetznercluster_controller.go:198 +0x260
github.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile(0x400038a380, {0x1c60e28, 0x40008150b0}, {{{0x400000f4e0?, 0x5?}, {0x400000f4d0?, 0x400076dd08?}}})
        github.com/syself/cluster-api-provider-hetzner/controllers/hetznercluster_controller.go:173 +0x734
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1c656e8?, {0x1c60e28?, 0x40008150b0?}, {{{0x400000f4e0?, 0xb?}, {0x400000f4d0?, 0x0?}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x80
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x4000622000, {0x1c60e60, 0x400024a2d0}, {0x177e7c0, 0x40001c7000})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x2d0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x4000622000, {0x1c60e60, 0x400024a2d0})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261 +0x158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222 +0x70
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 161
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218 +0x3b8

Dual-0 avatar May 07 '25 14:05 Dual-0

I think I got it, controlPlaneEndpoint is needed...

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
metadata:
  name: ${CLUSTER_NAME}
  namespace: default
spec:
  hcloudNetwork:
  controlPlaneEndpoint:
    host: ""
    port: 6443

Dual-0 avatar May 07 '25 17:05 Dual-0

Hi @Dual-0, yes it is. But it actually should not - or at least in other parts of the code we don't enforce that. We have to check how to solve this inconsistency. I'm surprised that this has never happened before, apparently you are the first one who tried without specifying the controlPlaneEndpoint! ;)

EDIT: controlPlaneLoadBalancer -> controlPlaneEndpoint

janiskemper avatar May 07 '25 17:05 janiskemper

We also ran into this which caused quite a bit of hair pulling :o Well, quite a bit of hair was lost because of version incompatibilities and some silly typo's on our end ;)

We're using ClusterClass, trying to provision a very simple test cluster with the default kubeadm provisioner but that caused the panic to happen.

The workaround from @Dual-0 Dual-0 worked like a charm. Thank you! 🙏

Is a new release of CAPH in the (near) future? The last k8s version that's supported by CAPH v1.0.6 is 1.31.x which will be EOL in a month.

BartVB avatar Sep 24 '25 20:09 BartVB

@BartVB

Is a new release of CAPH in the (near) future? The last k8s version that's supported by CAPH v1.0.6 is 1.31.x which will be EOL in a month.

Yes, we have some big PRs which we will merge and release soon:

:seedling: Provision baremetal via --baremetal-image-url-command by guettli · Pull Request #1679 · syself/cluster-api-provider-hetzner

:seedling: Hcloud: Provision hcloud machines with custom command (instead of Snapshots) by guettli · Pull Request #1647 · syself/cluster-api-provider-hetzner

The PR to update CAPI and controller-runtime version is already merged to main.

guettli avatar Sep 25 '25 06:09 guettli

I created a PR so that you see an error instead of a panic: :seedling: Avoid panic if hetznercluster.spec.controlPlaneEndpoint is not set by guettli · Pull Request #1684 · syself/cluster-api-provider-hetzner

@BartVB @Dual-0 now you would see that error:

hetznercluster.spec.controlPlaneEndpoint is not set

Does this help? Anything else which could be improved (in the context of this current issue)?

guettli avatar Sep 25 '25 07:09 guettli