sidero icon indicating copy to clipboard operation
sidero copied to clipboard

ClusterAPI Machine stuck in "Pending" indefinitely

Open lieberlois opened this issue 2 years ago • 5 comments

After a full reinstallation of Sidero Metal with CAPI etc. I have the problem that my machine doesnt boot into Talos.

The first PXE boot worked perfectly, discovery etc. worked, also the BMC entry is present (and works, tested with ipmitool) in the server. However I now applied the cluster manifests, my machine doesn't boot (not when I manually boot, also not via IPMI). It seems like Sidero doesnt use the (available) server (see output below) since its not allocated.

Any ideas?

$ kubectl get machine
NAME                 CLUSTER     NODENAME   PROVIDERID   PHASE     AGE     VERSION
cluster-0-cp-t6zhc   cluster-0                           Pending   6m38s   v1.28.1

$ kubectl get server
NAME                                   HOSTNAME   ACCEPTED   CORDONED   ALLOCATED   CLEAN   POWER   AGE
00000000-0000-0000-0000-000000000000   (none)     true                              true    off     21m

$ kubectl get serverclass
NAME   AVAILABLE                                  IN USE   AGE
any    ["00000000-0000-0000-0000-000000000000"]   []       30m

$ kubectl get serverbindings
No resources found

I also found this log line:

2023-10-16T09:41:41Z    INFO    controllers.MetalMachine.machine=cluster-0-cp-t6zhc.cluster=cluster-0   Bootstrap secret is not available yet   {"metalmachine": {"name":"cluster-0-cp-95bmt","namespace":"default"}}

lieberlois avatar Oct 16 '23 09:10 lieberlois

There's no way we can guess this. As always in CAPI, it makes sense to inspect all states of all resources.

clusterctl status (or something like that) provides a nice overview

smira avatar Oct 16 '23 10:10 smira

cabpt-controller-manager-5687d76d6f-55xg6 manager 2023-10-16T10:36:15Z  INFO    Starting Controller     {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig"}
cabpt-controller-manager-5687d76d6f-55xg6 manager W1016 10:36:15.626507       1 reflector.go:533] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope

After looking through logs, this seems like it might be an RBAC issue maybe?

clusterctl describe cluster cluster-0 outputs the following:

NAME                                                           READY  SEVERITY  REASON                          SINCE  MESSAGE                                                                                                  
Cluster/cluster-0                                              False  Error     BootstrapTemplateCloningFailed  16m    Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...  
├─ClusterInfrastructure - MetalCluster/cluster-0                                                                                                                                                                                 
└─ControlPlane - TalosControlPlane/cluster-0-cp                False  Error     BootstrapTemplateCloningFailed  16m    Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...  
  └─Machine/cluster-0-cp-szn6c                                 False  Info      WaitingForInfrastructure        15m    0 of 2 completed                                                                                          
    ├─BootstrapConfig - TalosConfig/cluster-0-cp-f5jj9                                                                                                                                                                           
    └─MachineInfrastructure - MetalMachine/cluster-0-cp-r8df7 

lieberlois avatar Oct 16 '23 10:10 lieberlois

looks like it's the failure to call a webhook, probably MachinePools is a different issue

either way as it works in Sidero integration tests, something is up with your setup (?)

smira avatar Oct 16 '23 10:10 smira

Mhm it's a fresh setup, and the (what i think to be the) same setup worked with 0.5.8 and 0.6.0 (with firewall configured to block port 67 and 68) 🤔 I'm running Clusterctl Version 1.5.2 and K8s version 1.27.6

I was just looking at the cabpt-controller-manager because it appears to crash every few minutes

E1016 11:00:17.100532       1 reflector.go:148] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1beta1.MachinePool: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope
2023-10-16T11:01:08Z    ERROR   Could not wait for Cache to sync        {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig", "error": "failed to wait for talosconfig caches to sync: timed out waiting for cache to be synced for Kind *v1alpha3.TalosConfig"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:233
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219

lieberlois avatar Oct 16 '23 11:10 lieberlois

Info: i added this rule to the cabpt-manager-role ClusterRole, now it appears to work

rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - '*'

To me it seems like an RBAC issue, though its unclear yet, why that is. One possibilty might be that ClusterAPI changed their apiGroups for some resources, but i guess that would be noted as a breaking change.

Just from looking at kubectl api-resources though, the MachinePools is from cluster.x-k8s.io/v1beta1 but the ClusterRole says exp.cluster.x-k8s.io 🤔

Note: I also have the ClusterAPI Provider for Azure installed, maybe there is a conflict in apiGroups? 🤔

lieberlois avatar Oct 16 '23 11:10 lieberlois