cluster-api-provider-hetzner
cluster-api-provider-hetzner copied to clipboard
CAPH Controller panics during HetznerCluster reconciliation
Description:
When attempting to provision a Kubernetes cluster on Hetzner Cloud using Cluster API (CAPI) with Cluster API Provider Hetzner (CAPH), the CAPH controller (caph-controller-manager) enters a CrashLoopBackOff state shortly after the HetznerCluster resource is applied.
The controller logs show a persistent panic: runtime error: invalid memory address or nil pointer dereference occurring specifically within the load balancer related code (github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.createOptsFromSpec).
This panic prevents the controller from reconciling the HetznerCluster object and subsequently provisioning any infrastructure resources on Hetzner Cloud. The CAPH webhook service also becomes unavailable as a result, leading to webhook errors when applying/updating resources.
Steps to reproduce:
- Set up a local Kubernetes cluster (using Kind) to serve as the Cluster API management cluster.
- Install Cluster API core components.
- Install Cluster API Bootstrap Provider Talos (CABPT) and Cluster API Control Plane Provider Talos (CACPPT).
- Install Cluster API Provider Hetzner (CAPH) version [Insert CAPH Version Here, e.g., from Image Tag].
- Create a Kubernetes Secret in the management cluster with the Hetzner Cloud API token.
- Apply a manifest defining a CAPI Cluster and associated Hetzner/Talos infrastructure resources (see Configuration section below). This manifest includes a
HetznerClusterresource withcontrolPlaneLoadBalancer.enabled: falseand a placeholdercontrolPlaneEndpoint. - Observe the logs of the
caph-controller-managerpod in thecaph-systemnamespace.
Expected behavior:
The CAPH controller should successfully reconcile the HetznerCluster resource, interact with the Hetzner Cloud API to provision the necessary infrastructure (like the private network and placement groups), and proceed without panicking.
Actual behavior:
The caph-controller-manager pod crashes repeatedly and enters CrashLoopBackOff. Logs show a panic: runtime error: invalid memory address or nil pointer dereference traceback related to load balancer logic. The HetznerCluster resource remains in a state indicating provisioning issues, and no infrastructure is provisioned on Hetzner Cloud.
Environment:
.env
export CLUSTER_NAME="my-cluster" # Choose a unique name for your cluster
export HCLOUD_REGION="fsn1" # Your desired server region (e.g., fsn1)
export HCLOUD_NETWORK_ZONE="eu-central" # Network Zone (must be one of the supported values from the error)
export HCLOUD_SSH_KEY="cluster" # The name of the SSH key in your Hetzner project
export HCLOUD_CONTROL_PLANE_MACHINE_TYPE="cax11" # Server type for control planes
export HCLOUD_WORKER_MACHINE_TYPE="cax21" # Server type for workers
export HCLOUD_TALOS_IMAGE_ID="ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515" # The Image ID you found in Step 2a
export CONTROL_PLANE_MACHINE_COUNT=1 # Number of control plane nodes (recommended 3 for HA)
export WORKER_MACHINE_COUNT=1 # Initial number of worker nodes (can be 0 for autoscaling later)
export KUBERNETES_VERSION="v1.31.8" # The Kubernetes version
export TALOS_VERSION="v1.10" # The Talos version compatible with your K8s version
hetzner-talos-cluster.yaml
# gitops/infrastructure/hetzner-talos-cluster.yaml
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: ${CLUSTER_NAME}
namespace: default # Or your target namespace
spec:
clusterNetwork:
pods:
cidrBlocks:
- 10.244.0.0/16 # Default for many CNIs, adjust if needed
services:
cidrBlocks:
- 10.96.0.0/12 # Default for many CNIs, adjust if needed
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
name: ${CLUSTER_NAME}-controlplane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
name: ${CLUSTER_NAME}
---
# HetznerCluster resource - using v1beta1
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
metadata:
name: ${CLUSTER_NAME}
namespace: default
spec:
hcloudNetwork:
enabled: true
networkZone: ${HCLOUD_NETWORK_ZONE} # e.g., "eu-central"
# cidrBlock: 10.0.0.0/16 # Optional: specify if you want a specific private network CIDR
# subnetCidrBlock: 10.0.0.0/24 # Optional: specify if you want a specific subnet CIDR
controlPlaneLoadBalancer:
enabled: true
region: ${HCLOUD_REGION} # not in docs but needed
type: lb11 # Specify the Load Balancer type (e.g., lb11, lb21)
sshKeys:
hcloud:
- name: ${HCLOUD_SSH_KEY}
hetznerSecretRef: # Reference the secret containing Hetzner token
name: hetzner # Name of the secret in the management cluster (default namespace)
key:
hcloudToken: hcloud # Assuming your secret key is named 'hcloud'
hcloudPlacementGroups:
- name: ${CLUSTER_NAME}-controlplane
type: spread
- name: ${CLUSTER_NAME}-worker
type: spread
controlPlaneRegions:
- "${HCLOUD_REGION}" # Use your defined region here, must be an array
---
# Template for Control Plane Machines' infrastructure (HCloud specific) - using v1beta1
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudMachineTemplate
metadata:
name: ${CLUSTER_NAME}-controlplane
namespace: default
spec:
template:
spec:
type: ${HCLOUD_CONTROL_PLANE_MACHINE_TYPE}
imageName: ${HCLOUD_TALOS_IMAGE_ID} # Use the Image/Snapshot ID
placementGroupName: ${CLUSTER_NAME}-controlplane
---
# Template for Worker Machines' infrastructure (HCloud specific) - using v1beta1
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudMachineTemplate
metadata:
name: ${CLUSTER_NAME}-worker
namespace: default
spec:
template:
spec:
type: ${HCLOUD_WORKER_MACHINE_TYPE}
imageName: ${HCLOUD_TALOS_IMAGE_ID} # Use the Image/Snapshot ID
placementGroupName: ${CLUSTER_NAME}-worker
---
# Talos bootstrap and control plane configuration for Control Plane nodes
# Use v1alpha3 for TalosControlPlane
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
name: ${CLUSTER_NAME}-controlplane
namespace: default
spec:
replicas: ${CONTROL_PLANE_MACHINE_COUNT}
version: ${KUBERNETES_VERSION}
infrastructureTemplate:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudMachineTemplate
name: ${CLUSTER_NAME}-controlplane
controlPlaneConfig:
controlplane:
generateType: controlplane
talosVersion: ${TALOS_VERSION}
strategicPatches:
- |
cluster:
externalCloudProvider:
enabled: true
- |
cluster:
network:
cni: null
---
# MachineDeployment for Worker nodes - using v1beta1
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: ${CLUSTER_NAME}-worker-pool
namespace: default
spec:
clusterName: ${CLUSTER_NAME}
replicas: ${WORKER_MACHINE_COUNT}
selector:
matchLabels: null
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
name: ${CLUSTER_NAME}-worker
clusterName: ${CLUSTER_NAME}
version: ${KUBERNETES_VERSION}
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HCloudMachineTemplate
name: ${CLUSTER_NAME}-worker
---
# Talos bootstrap configuration template for Worker nodes
# Use v1alpha3 for TalosConfigTemplate
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
metadata:
name: ${CLUSTER_NAME}-worker
namespace: default
spec:
template:
spec:
# These fields were correct as per the TalosConfigTemplate schema
generateType: worker
talosVersion: ${TALOS_VERSION}
strategicPatches:
- |
cluster:
externalCloudProvider:
enabled: true
- |
cluster:
network:
cni: null
---
Logs:
{"level":"INFO","time":"2025-05-07T14:29:15.090Z","file":"controllers/hcloudmachinetemplate_controller.go:92","message":"HCloudMachineTemplate is missing ownerRef to cluster or cluster does not exist default/my-cluster-controlplane","controller":"hcloudmachinetemplate","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HCloudMachineTemplate","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"},"namespace":"default","name":"my-cluster-controlplane","reconcileID":"67aa9fa6-e7de-4148-8a32-2240d1ca9ab8","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"}}
{"level":"INFO","time":"2025-05-07T14:29:15.172Z","file":"controllers/hcloudmachinetemplate_controller.go:92","message":"HCloudMachineTemplate is missing ownerRef to cluster or cluster does not exist default/my-cluster-controlplane","controller":"hcloudmachinetemplate","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HCloudMachineTemplate","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"},"namespace":"default","name":"my-cluster-controlplane","reconcileID":"f13c8e01-b69c-4731-ac9b-8dccc4fdc35a","HCloudMachineTemplate":{"name":"my-cluster-controlplane","namespace":"default"}}
{"level":"INFO","time":"2025-05-07T14:29:15.211Z","file":"controller/controller.go:110","message":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"hetznercluster","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerCluster","HetznerCluster":{"name":"my-cluster","namespace":"default"},"namespace":"default","name":"my-cluster","reconcileID":"85eab0d5-cbf7-4fb8-bae3-6f967bab82f7"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x144e678]
goroutine 385 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:111 +0x19c
panic({0x16a3180?, 0x2bbdda0?})
runtime/panic.go:791 +0x124
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.createOptsFromSpec(0x400061d508)
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer/loadbalancer.go:326 +0x1b8
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.(*Service).createLoadBalancer(0x4000aa5828, {0x1c60e28, 0x40008152f0})
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer/loadbalancer.go:290 +0x3c
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer.(*Service).Reconcile(0x4000aa5828, {0x1c60e28, 0x40008152f0})
github.com/syself/cluster-api-provider-hetzner/pkg/services/hcloud/loadbalancer/loadbalancer.go:81 +0x1f0
github.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal(0x400038a380, {0x1c60e28, 0x40008152f0}, 0x400033a690)
github.com/syself/cluster-api-provider-hetzner/controllers/hetznercluster_controller.go:198 +0x260
github.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile(0x400038a380, {0x1c60e28, 0x40008150b0}, {{{0x400000f4e0?, 0x5?}, {0x400000f4d0?, 0x400076dd08?}}})
github.com/syself/cluster-api-provider-hetzner/controllers/hetznercluster_controller.go:173 +0x734
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1c656e8?, {0x1c60e28?, 0x40008150b0?}, {{{0x400000f4e0?, 0xb?}, {0x400000f4d0?, 0x0?}}})
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x80
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x4000622000, {0x1c60e60, 0x400024a2d0}, {0x177e7c0, 0x40001c7000})
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x2d0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x4000622000, {0x1c60e60, 0x400024a2d0})
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261 +0x158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222 +0x70
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 161
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218 +0x3b8
I think I got it, controlPlaneEndpoint is needed...
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
metadata:
name: ${CLUSTER_NAME}
namespace: default
spec:
hcloudNetwork:
controlPlaneEndpoint:
host: ""
port: 6443
Hi @Dual-0, yes it is. But it actually should not - or at least in other parts of the code we don't enforce that. We have to check how to solve this inconsistency. I'm surprised that this has never happened before, apparently you are the first one who tried without specifying the controlPlaneEndpoint! ;)
EDIT: controlPlaneLoadBalancer -> controlPlaneEndpoint
We also ran into this which caused quite a bit of hair pulling :o Well, quite a bit of hair was lost because of version incompatibilities and some silly typo's on our end ;)
We're using ClusterClass, trying to provision a very simple test cluster with the default kubeadm provisioner but that caused the panic to happen.
The workaround from @Dual-0 Dual-0 worked like a charm. Thank you! 🙏
Is a new release of CAPH in the (near) future? The last k8s version that's supported by CAPH v1.0.6 is 1.31.x which will be EOL in a month.
@BartVB
Is a new release of CAPH in the (near) future? The last k8s version that's supported by CAPH v1.0.6 is 1.31.x which will be EOL in a month.
Yes, we have some big PRs which we will merge and release soon:
The PR to update CAPI and controller-runtime version is already merged to main.
I created a PR so that you see an error instead of a panic: :seedling: Avoid panic if hetznercluster.spec.controlPlaneEndpoint is not set by guettli · Pull Request #1684 · syself/cluster-api-provider-hetzner
@BartVB @Dual-0 now you would see that error:
hetznercluster.spec.controlPlaneEndpoint is not set
Does this help? Anything else which could be improved (in the context of this current issue)?