cluster-api-provider-harvester icon indicating copy to clipboard operation
cluster-api-provider-harvester copied to clipboard

New cluster using Talos is not progressing beyond Machines in Provisioning stage.

Open dhaugli opened this issue 8 months ago • 5 comments

What happened: [A clear and concise description of what the bug is.]

The cluster is not coming up, Harvester Loadbalancer is not created, machines never leave provisioning state. The machines is provisioned in harvester, gets IP from my network. I can attach a console to them. Though its Talos so its not much you get in return.

Screenshot of console of one of the talos cp vms:

Screenshot 2024-06-06 232557

caph-provider logs:

 ERROR   failed to patch HarvesterMachine        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile.func1
        /workspace/controllers/harvestermachine_controller.go:121
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile
        /workspace/controllers/harvestermachine_controller.go:198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2024-06-06T19:58:10Z    ERROR   Reconciler error        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  1. These two log entries keeps going.
 2024-06-06T19:58:10Z    INFO    Reconciling HarvesterMachine ...        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82"}
2024-06-06T19:58:10Z    INFO    Waiting for ProviderID to be set on Node resource in Workload Cluster ...       {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}

capt-controller-manager logs:

I0606 19:58:08.737945       1 taloscontrolplane_controller.go:176] "controllers/TalosControlPlane: successfully updated control plane status" namespace="cluster-capi-mgmt-p-01" talosControlPlane="capi-mgmt-p-01" cluster="capi-mgmt-p-01"
I0606 19:58:08.739615       1 controller.go:327] "Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
E0606 19:58:08.739746       1 controller.go:329] "Reconciler error" err="at least one machine should be provided" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
I0606 19:58:08.749008       1 taloscontrolplane_controller.go:189] "reconcile TalosControlPlane" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="c37dc309-f8fb-42c7-a375-5faceb9019b9" cluster="capi-mgmt-p-01"
I0606 19:58:09.190175       1 scale.go:33] "controllers/TalosControlPlane: scaling up control plane" Desired=3 Existing=1
I0606 19:58:09.213294       1 taloscontrolplane_controller.go:152] "controllers/TalosControlPlane: attempting to set control plane status"
I0606 19:58:09.220900       1 taloscontrolplane_controller.go:564] "controllers/TalosControlPlane: failed to get kubeconfig for the cluster" error="failed to create cluster accessor: error creating client for remote cluster \"cluster-capi-mgmt-p-01/capi-mgmt-p-01\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.0.0.113:6443/api/v1?timeout=10s\": tls: failed to verify certificate: x509: certificate is valid for 10.0.0.3, 127.0.0.1, ::1, 10.0.0.5, 10.53.0.1, not 10.0.0.113"

cabpt-talos-bootstrap(I dont know if this is relevant):

I0606 19:58:09.206570       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.224117       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.243118       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.280372       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.341804       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.352557       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.439369       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.480714       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.539945       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.548156       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.717884       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.720944       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-7shhp"
I0606 19:58:09.756344       1 talosconfig_controller.go:223] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: ignoring an already ready config"
I0606 19:58:09.765995       1 secrets.go:243] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: updating talosconfig" endpoints=null secret="capi-mgmt-p-01-talosconfig"

What did you expect to happen: I expected that the caph provider created the LB and proceeded on creating the cluster.

How to reproduce it:

I added the providers for talos (boostrap and controlplane) and ofcourse the harvester provider.

Added 4 files + the harvester secret with the following configuration:

cluster.yaml:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 172.16.0.0/20
    services:
      cidrBlocks:
        - 172.16.16.0/20
    serviceDomain: cluster.local
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: capi-mgmt-p-01
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterCluster
    name: capi-mgmt-p-01

harvester-cluster.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterCluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  targetNamespace: cluster-capi-mgmt-p-01
  loadBalancerConfig:
    ipamType: pool
    ipPoolRef: k8s-api
  server: https://10.0.0.3
  identitySecret: 
    name: trollit-harvester-secret
    namespace: cluster-capi-mgmt-p-01

harvester-machinetemplate.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterMachineTemplate
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  template: 
    spec:
      cpu: 2
      memory: 8Gi
      sshUser: ubuntu
      sshKeyPair: default/david
      networks:
      -  cluster-capi-mgmt-p-01/capi-mgmt-network
      volumes:
      - volumeType: image 
        imageName: harvester-public/talos-1.7.4-metalqemu
        volumeSize: 50Gi
        bootOrder: 0

controlplane.yaml:

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  version: "v1.30.0"
  replicas: 3
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterMachineTemplate
    name: capi-mgmt-p-01
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.7.4
      configPatches:
        - op: add
          path: /cluster/network
          value:
            cni:
              name: none

        - op: add
          path: /cluster/proxy
          value:
            disabled: true

        - op: add
          path: /cluster/network/podSubnets
          value:
            - 172.16.0.0/20

        - op: add
          path: /cluster/network/serviceSubnets
          value:
            - 172.16.16.0/20

        - op: add
          path: /machine/kubelet/extraArgs
          value:
            cloud-provider: external

        - op: add
          path: /machine/kubelet/nodeIP
          value:
            validSubnets:
              - 10.0.0.0/24

        - op: add
          path: /cluster/discovery
          value:
            enabled: false

        - op: add
          path: /machine/features/kubePrism
          value:
            enabled: true

        - op: add
          path: /cluster/apiServer/certSANs
          value:
            - 127.0.0.1

        - op: add
          path: /cluster/apiServer/extraArgs
          value:
            anonymous-auth: true

Anything else you would like to add:

I have tried to switch the Loadbalancer config from dhcp to ipPoolRef, and set a pre-configured ippool this also did not work. I think its related to that the LB is never provisioned in the first place.

[Miscellaneous information that will assist in solving the issue.]

Environment:

  • talos controlplane provider version: 0.5.5
  • talos bootstrap provider version: 0.6.4
  • harvester cluster api provider: 0.1.2
  • harvester version installed on my HP server: 1.3.0
  • OS (e.g. from /etc/os-release):

dhaugli avatar Jun 06 '24 21:06 dhaugli