cluster-api-provider-proxmox icon indicating copy to clipboard operation
cluster-api-provider-proxmox copied to clipboard

Worker never starts

Open rhjensen79 opened this issue 1 year ago • 8 comments

/kind bug

What steps did you take and what happened: When i apply the cappx-test.yaml then 2 vm's get created and resized. The master starts, and seam to run all the way thru. The worker, never starts, so the deployment never finishes.

When i do a describe of the cluster is says provisioned. So i'm guessing it thinks it ok.

What did you expect to happen: I expect a full cluster, with 1 master and 1 worker, being provisioned.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

I'm using latest 0.3.5

  • Cluster-api-provider-proxmox version:
    clusterctl version: &version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"ef04465b2ba76214eea570e27e8146c96412e32a", GitTreeState:"clean", BuildDate:"2024-04-23T17:05:53Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"darwin/arm64"}

  • Proxmox VE version: pve-manager/8.2.2/9355359cd7afbae4

  • Kubernetes version: (use kubectl version): Client Version: v1.30.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2

Logs From cappx-controller-manager

I0426 09:08:46.032786       1 qemu.go:21] "Reconciling QEMU" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=d07991f6-3318-4eb0-b6e7-e689bee83fe8
I0426 09:08:46.032794       1 qemu.go:58] "getting qemu from vmid" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=d07991f6-3318-4eb0-b6e7-e689bee83fe8
I0426 09:08:46.032883       1 scheduler.go:173] "Start Running Scheduler" Name="qemu-scheduler"
I0426 09:08:46.032912       1 scheduler.go:196] "getting next qemu from scheduling queue" Name="qemu-scheduler"
E0426 09:08:49.143170       1 qemu.go:26] "failed to get qemu" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=d07991f6-3318-4eb0-b6e7-e689bee83fe8
E0426 09:08:49.143232       1 reconcile.go:27] "failed to create/get instance" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=d07991f6-3318-4eb0-b6e7-e689bee83fe8
E0426 09:08:49.143251       1 proxmoxmachine_controller.go:160] "Reconcile error" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=d07991f6-3318-4eb0-b6e7-e689bee83fe8
E0426 09:08:49.144324       1 controller.go:324] "Reconciler error" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=d07991f6-3318-4eb0-b6e7-e689bee83fe8
I0426 09:09:36.364952       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
I0426 09:09:46.032184       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
I0426 09:14:07.077886       1 proxmoxmachine_controller.go:144] "Reconciling ProxmoxMachine" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:10.107725       1 reconcile.go:24] "Reconciling instance" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:10.108082       1 reconcile.go:105] "instance does not have providerID yet" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:10.108105       1 reconcile.go:89] "instance wasn't found. new instance will be created" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:10.108147       1 qemu.go:21] "Reconciling QEMU" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:10.108184       1 qemu.go:58] "getting qemu from vmid" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:10.108240       1 scheduler.go:173] "Start Running Scheduler" Name="qemu-scheduler"
I0426 09:14:10.108362       1 scheduler.go:196] "getting next qemu from scheduling queue" Name="qemu-scheduler"
E0426 09:14:13.132285       1 qemu.go:26] "failed to get qemu" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
E0426 09:14:13.132332       1 reconcile.go:27] "failed to create/get instance" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
E0426 09:14:13.132348       1 proxmoxmachine_controller.go:160] "Reconcile error" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
E0426 09:14:13.134523       1 controller.go:324] "Reconciler error" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=3bc683b1-d9c9-4036-801f-4d9e79a5d3c3
I0426 09:14:16.832442       1 proxmoxmachine_controller.go:144] "Reconciling ProxmoxMachine" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:14:19.891301       1 reconcile.go:24] "Reconciling instance" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:14:19.891383       1 reconcile.go:105] "instance does not have providerID yet" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:14:19.891412       1 reconcile.go:89] "instance wasn't found. new instance will be created" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:14:19.891435       1 qemu.go:21] "Reconciling QEMU" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:14:19.891470       1 qemu.go:58] "getting qemu from vmid" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:14:19.891862       1 scheduler.go:173] "Start Running Scheduler" Name="qemu-scheduler"
I0426 09:14:19.891984       1 scheduler.go:196] "getting next qemu from scheduling queue" Name="qemu-scheduler"
E0426 09:14:22.915171       1 qemu.go:26] "failed to get qemu" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
E0426 09:14:22.915244       1 reconcile.go:27] "failed to create/get instance" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
E0426 09:14:22.915348       1 proxmoxmachine_controller.go:160] "Reconcile error" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
E0426 09:14:22.916430       1 controller.go:324] "Reconciler error" err="401 - 401 permission denied - invalid PVE ticket - " controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-wj9zl" namespace="default" name="cappx-test-wj9zl" reconcileID=8022b2d7-4f4e-41bc-9085-33c4b233bcd3
I0426 09:15:10.107757       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
I0426 09:15:19.894443       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
I0426 09:25:08.500440       1 proxmoxmachine_controller.go:144] "Reconciling ProxmoxMachine" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=0a4c7924-da5e-4521-9275-837c686c795a
I0426 09:25:11.531027       1 reconcile.go:24] "Reconciling instance" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=0a4c7924-da5e-4521-9275-837c686c795a
I0426 09:25:11.531104       1 reconcile.go:105] "instance does not have providerID yet" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=0a4c7924-da5e-4521-9275-837c686c795a
I0426 09:25:11.531130       1 reconcile.go:89] "instance wasn't found. new instance will be created" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=0a4c7924-da5e-4521-9275-837c686c795a
I0426 09:25:11.531167       1 qemu.go:21] "Reconciling QEMU" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=0a4c7924-da5e-4521-9275-837c686c795a
I0426 09:25:11.531230       1 qemu.go:58] "getting qemu from vmid" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cappx-test-md-0-fwh9p-49pl4" namespace="default" name="cappx-test-md-0-fwh9p-49pl4" reconcileID=0a4c7924-da5e-4521-9275-837c686c795a
I0426 09:25:11.531454       1 scheduler.go:173] "Start Running Scheduler" Name="qemu-scheduler"
I0426 09:25:11.531528       1 scheduler.go:196] "getting next qemu from scheduling queue" Name="qemu-scheduler"

rhjensen79 avatar Apr 26 '24 09:04 rhjensen79

your error says 401 - 401 permission denied - invalid PVE ticket - this error comes from proxmox api. so could you check if you gave correct permissions to your proxmox user ?

sp-yduck avatar Apr 26 '24 09:04 sp-yduck

your error says 401 - 401 permission denied - invalid PVE ticket - this error comes from proxmox api. so could you check if you gave correct permissions to your proxmox user ?

I'm running this with my root account root@pam

I can see in the that the proxmox_secret and proxmox_tokenid is empty in the cappx-test.yaml file. Are they required ? Note i have tried filling them out, with an id and token, that should have enough permission, but i can' read anywhere, if username and pass is enough.

Do you know ?

rhjensen79 avatar Apr 26 '24 10:04 rhjensen79

I can see in the that the proxmox_secret and proxmox_tokenid is empty in the cappx-test.yaml file. Are they required ?

no. they are not required. previously we are trying to support secret/tokenid but right now it's not even supported due to proxmox side limitation

I'm running this with my root account root@pam

hmm that's interesting I've never seen this kind of error before.. could you try to restart the cappx controller pod and see how it works ? cappx rotates the token (which used for api auth) automatically though it might be something wrong/bug around that rotation logic.

sp-yduck avatar Apr 26 '24 10:04 sp-yduck

I'm running it in a kind cluster, and have redeployed it several times. But i will try to start a new deployment, and then restart the cappx controller pod when it get's stuck, to see. And then update here.

rhjensen79 avatar Apr 26 '24 10:04 rhjensen79

A restart kicked off the worker VM. It sems it not fully goind thru, but have to test more on sunday/monday. But restarting the pod, did trigger something. @sp-yduck

rhjensen79 avatar Apr 26 '24 11:04 rhjensen79

I am having the same issue.

After restarting the controller, it is still looking for the machine.

simcax avatar Apr 27 '24 08:04 simcax

@simcax and you seeing same 401 error in the log ??

sp-yduck avatar May 01 '24 00:05 sp-yduck

@sp-yduck I don't see a 401 error. What I see in the capi-kubeadm-control-plane pod is that the machine does not have a "corresponding node yet":

I0510 20:54:53.481369       1 controller.go:432] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/cluster07" namespace="default" name="cluster07" reconcileID="5cb4c3d0-400f-469b-8081-07ddc932db59" Cluster="default/cluster07" Desired=3 Existing=1
I0510 20:54:53.481413       1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/cluster07" namespace="default" name="cluster07" reconcileID="5cb4c3d0-400f-469b-8081-07ddc932db59" Cluster="default/cluster07" failures="Machine cluster07-tkkcj does not have a corresponding Node yet (Machine.status.nodeRef not set)"
I0510 20:54:53.481589       1 recorder.go:104] "Waiting for control plane to pass preflight checks to continue reconciliation: Machine cluster07-tkkcj does not have a corresponding Node yet (Machine.status.nodeRef not set)" logger="events" type="Warning" object={"kind":"KubeadmControlPlane","namespace":"default","name":"cluster07","uid":"8baac75a-2063-45ee-b408-82076d283cfc","apiVersion":"controlplane.cluster.x-k8s.io/v1beta1","resourceVersion":"2479837"} reason="ControlPlaneUnhealthy"

In the cappx-controller-manager is doesn't go further from the "stop running scheduler":

manager I0510 20:14:59.416404       1 reconcile.go:44] "updating instance config status" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cluster07-tkkcj" namespace="default" name="cluster07-tkkcj" reconcileID=07fdc0e5-dc6d-492b-9df2-da58f14fe358
manager I0510 20:14:59.416440       1 proxmoxmachine_controller.go:169] "ProxmoxMachine instance is running" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/cluster07-tkkcj" namespace="default" name="cluster07-tkkcj" reconcileID=07fdc0e5-dc6d-492b-9df2-da58f14fe358 bios-uuid="044e1968-c078-4c45-b58c-922736bee08c"
manager I0510 20:15:10.041358       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler" qemu="cluster07-tkkcj"
manager I0510 20:15:27.895312       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
manager I0510 20:15:34.416653       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
manager I0510 20:15:40.417357       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"
manager I0510 20:15:46.416751       1 scheduler.go:175] "Stop Running Scheduler" Name="qemu-scheduler"

simcax avatar May 10 '24 21:05 simcax