sidero
sidero copied to clipboard
ClusterAPI Machine stuck in "Pending" indefinitely
After a full reinstallation of Sidero Metal with CAPI etc. I have the problem that my machine doesnt boot into Talos.
The first PXE boot worked perfectly, discovery etc. worked, also the BMC entry is present (and works, tested with ipmitool) in the server. However I now applied the cluster manifests, my machine doesn't boot (not when I manually boot, also not via IPMI). It seems like Sidero doesnt use the (available) server (see output below) since its not allocated.
Any ideas?
$ kubectl get machine
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
cluster-0-cp-t6zhc cluster-0 Pending 6m38s v1.28.1
$ kubectl get server
NAME HOSTNAME ACCEPTED CORDONED ALLOCATED CLEAN POWER AGE
00000000-0000-0000-0000-000000000000 (none) true true off 21m
$ kubectl get serverclass
NAME AVAILABLE IN USE AGE
any ["00000000-0000-0000-0000-000000000000"] [] 30m
$ kubectl get serverbindings
No resources found
I also found this log line:
2023-10-16T09:41:41Z INFO controllers.MetalMachine.machine=cluster-0-cp-t6zhc.cluster=cluster-0 Bootstrap secret is not available yet {"metalmachine": {"name":"cluster-0-cp-95bmt","namespace":"default"}}
There's no way we can guess this. As always in CAPI, it makes sense to inspect all states of all resources.
clusterctl status (or something like that) provides a nice overview
cabpt-controller-manager-5687d76d6f-55xg6 manager 2023-10-16T10:36:15Z INFO Starting Controller {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig"}
cabpt-controller-manager-5687d76d6f-55xg6 manager W1016 10:36:15.626507 1 reflector.go:533] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope
After looking through logs, this seems like it might be an RBAC issue maybe?
clusterctl describe cluster cluster-0 outputs the following:
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/cluster-0 False Error BootstrapTemplateCloningFailed 16m Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...
├─ClusterInfrastructure - MetalCluster/cluster-0
└─ControlPlane - TalosControlPlane/cluster-0-cp False Error BootstrapTemplateCloningFailed 16m Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...
└─Machine/cluster-0-cp-szn6c False Info WaitingForInfrastructure 15m 0 of 2 completed
├─BootstrapConfig - TalosConfig/cluster-0-cp-f5jj9
└─MachineInfrastructure - MetalMachine/cluster-0-cp-r8df7
looks like it's the failure to call a webhook, probably MachinePools is a different issue
either way as it works in Sidero integration tests, something is up with your setup (?)
Mhm it's a fresh setup, and the (what i think to be the) same setup worked with 0.5.8 and 0.6.0 (with firewall configured to block port 67 and 68) 🤔 I'm running Clusterctl Version 1.5.2 and K8s version 1.27.6
I was just looking at the cabpt-controller-manager because it appears to crash every few minutes
E1016 11:00:17.100532 1 reflector.go:148] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1beta1.MachinePool: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope
2023-10-16T11:01:08Z ERROR Could not wait for Cache to sync {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig", "error": "failed to wait for talosconfig caches to sync: timed out waiting for cache to be synced for Kind *v1alpha3.TalosConfig"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:233
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/.cache/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219
Info: i added this rule to the cabpt-manager-role ClusterRole, now it appears to work
rules:
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*'
To me it seems like an RBAC issue, though its unclear yet, why that is. One possibilty might be that ClusterAPI changed their apiGroups for some resources, but i guess that would be noted as a breaking change.
Just from looking at kubectl api-resources though, the MachinePools is from cluster.x-k8s.io/v1beta1 but the ClusterRole says exp.cluster.x-k8s.io 🤔
Note: I also have the ClusterAPI Provider for Azure installed, maybe there is a conflict in apiGroups? 🤔