cluster-api-bootstrap-provider-talos
cluster-api-bootstrap-provider-talos copied to clipboard
Question/issue around Talos bootstrap with Cluster API & vSphere infrastructure (CAPV)
Greetings,
We've been playing with Talos Linux and Cluster API to automate the management of our clusters, and are currently facing some questions/issues around the bootstrap process using the vSphere infrastructure provider.
Versions / Environment
- Kubernetes: 1.27.5
- Talos: 1.5.2 (OVA)
- Cluster API Infrastructure: vSphere 1.8.1
- Cluster API Bootstrap: Talos 0.6.2
- Cluster API CP: Talos 0.5.3
- VMWare ESXi 7.0.3
Description
According to the Talos - VMware documentation, we have to install a custom talos-vmtools with some dedicated Talos config.
This totally makes senses, however, my concern if the following:
In order to bootstrap the cluster via Cluster API, and especially the CACPPT controller, I need my CAPV controller to retrieve the IP address of the VM via the vCenter API. However, such IP is only available upon successful installation and configuration of the VMTools. Unfortunately, to install the VMTools, I need to necessarily have the Talos bootstrap done due to the fact that it is deployed as a DaemonSet. This makes us hit the chicken/egg problem.
Our current workaround is to manually bootstrap the cluster via the IP addresses provided by the DHCP. However, this is quite a pain as we wish to automate everything via GitOps since we will manage quite a lot of permanent clusters, but also some ephemeral ones.
Do you have any insights or recommendations to achieve such goal using the VMware ecosystem ?
Reproduce Steps
The following steps can be performed to easily reproduce the issue:
- Create a transient cluster that will be used to spawn the first permanent management cluster via Cluster API.
The cluster can either be created directly on vSphere or kind/k3d/...
- Initialize Cluster API components on the transient cluster with
clusterctlwithCAPV,CABPTandCACPPT
clusterctl init \
--infrastructure vsphere:v1.8.1 \
--bootstrap talos:v0.6.2 \
--control-plane talos:v0.5.3 \
--target-namespace cluster-api-system
- Create the permanent management cluster with the following minimal manifests:
Click to expand manifests
---
apiVersion: v1
kind: Secret
metadata:
name: observability-cluster-poc
namespace: cluster-api-system
stringData:
password: REDACTED
username: REDACTED
---
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
metadata:
name: observability-cluster-poc-md-0
namespace: cluster-api-system
spec:
template:
spec:
configPatches:
- op: add
path: /machine/network
value:
interfaces:
- dhcp: true
dhcpOptions:
routeMetric: 1
interface: eth0
- dhcp: true
dhcpOptions:
routeMetric: 10
interface: eth1
- op: add
path: /machine/install
value:
extraKernelArgs:
- net.ifnames=0
- op: add
path: /cluster/network/cni
value:
name: none
- op: add
path: /cluster/proxy
value:
disabled: true
- op: add
path: /machine/features/kubePrism
value:
enabled: true
port: 7445
- op: replace
path: /cluster/controlPlane
value:
endpoint: https://172.30.11.10:6443
- op: add
path: /machine/certSANs
value:
- 172.30.11.10
- op: add
path: /machine/time
value:
disabled: false
servers:
- 172.30.110.1
- op: replace
path: /cluster/extraManifests
value:
- https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml
- op: add
path: /machine/kubelet/extraArgs
value:
cloud-provider: external
generateType: worker
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
labels:
cluster.x-k8s.io/cluster-name: observability-cluster-poc
name: observability-cluster-poc
namespace: cluster-api-system
spec:
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
name: observability-cluster-poc
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereCluster
name: observability-cluster-poc
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
labels:
cluster.x-k8s.io/cluster-name: observability-cluster-poc
name: observability-cluster-poc-md-0
namespace: cluster-api-system
spec:
clusterName: observability-cluster-poc
replicas: 3
selector:
matchLabels: {}
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
labels:
cluster.x-k8s.io/cluster-name: observability-cluster-poc
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
name: observability-cluster-poc-md-0
clusterName: observability-cluster-poc
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
name: observability-cluster-poc-worker
version: v1.27.5
---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
name: observability-cluster-poc
namespace: cluster-api-system
spec:
controlPlaneConfig:
controlplane:
configPatches:
- op: add
path: /machine/network
value:
interfaces:
- dhcp: true
dhcpOptions:
routeMetric: 1
interface: eth0
vip:
ip: 172.30.11.10
- dhcp: true
dhcpOptions:
routeMetric: 10
interface: eth1
- op: add
path: /machine/install
value:
extraKernelArgs:
- net.ifnames=0
- op: add
path: /cluster/network/cni
value:
name: none
- op: add
path: /cluster/proxy
value:
disabled: true
- op: add
path: /machine/features/kubePrism
value:
enabled: true
port: 7445
- op: replace
path: /cluster/controlPlane
value:
endpoint: https://172.30.11.10:6443
- op: add
path: /machine/certSANs
value:
- 172.30.11.10
- op: add
path: /cluster/coreDNS
value:
disabled: true
- op: add
path: /machine/time
value:
disabled: false
servers:
- 172.30.110.1
- op: replace
path: /cluster/extraManifests
value:
- https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml
- op: add
path: /machine/kubelet/extraArgs
value:
cloud-provider: external
generateType: controlplane
infrastructureTemplate:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
name: observability-cluster-poc
replicas: 3
rolloutStrategy:
rollingUpdate:
maxSurge: 1
type: RollingUpdate
version: v1.27.6
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereCluster
metadata:
name: observability-cluster-poc
namespace: cluster-api-system
spec:
controlPlaneEndpoint:
host: 172.30.11.10
port: 6443
identityRef:
kind: Secret
name: observability-cluster-poc
server: REDACTED
thumbprint: REDACTED
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
name: observability-cluster-poc
namespace: cluster-api-system
spec:
template:
spec:
cloneMode: linkedClone
customVMXKeys:
disk.EnableUUID: "true"
datacenter: REDACTED
datastore: REDACTED
diskGiB: 25
folder: cluster-api-vms
memoryMiB: 8192
network:
devices:
- dhcp4: true
dhcp4Overrides:
routeMetric: 1
networkName: PLATFORM-PRODUCTION-OBSERVABILITY
- dhcp4: true
dhcp4Overrides:
routeMetric: 10
networkName: PRODUCTION
numCPUs: 2
os: Linux
powerOffMode: hard
resourcePool: Cluster-API-POC
server: REDACTED
storagePolicyName: ""
tagIDs:
- urn:vmomi:InventoryServiceTag:0fe8eb41-7a8f-47b3-a9fe-0d288ec787dd:GLOBAL
- urn:vmomi:InventoryServiceTag:4495a9ce-727a-4814-b067-682b52130cad:GLOBAL
template: talos-linux-1.5.2
thumbprint: REDACTED
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
name: observability-cluster-poc-worker
namespace: cluster-api-system
spec:
template:
spec:
cloneMode: linkedClone
customVMXKeys:
disk.EnableUUID: "true"
datacenter: REDACTED
datastore: REDACTED
diskGiB: 25
folder: cluster-api-vms
memoryMiB: 8192
network:
devices:
- dhcp4: true
dhcp4Overrides:
routeMetric: 1
networkName: PLATFORM-PRODUCTION-OBSERVABILITY
- dhcp4: true
dhcp4Overrides:
routeMetric: 10
networkName: PRODUCTION
numCPUs: 2
os: Linux
powerOffMode: hard
resourcePool: Cluster-API-POC
server: REDACTED
storagePolicyName: ""
tagIDs:
- urn:vmomi:InventoryServiceTag:0fe8eb41-7a8f-47b3-a9fe-0d288ec787dd:GLOBAL
- urn:vmomi:InventoryServiceTag:4495a9ce-727a-4814-b067-682b52130cad:GLOBAL
template: talos-linux-1.5.2
thumbprint: REDACTED
- Once the VMs are created, confirm that the bootstrap cannot occur since VMTools cannot be installed and the bootstrap cannot be done either as it cannot reach the VMs due to the lack of IP Addresses at vCenter level.
Useful outputs/content
Talos console:
vSphere machine (no IP due to VMtools not being installable at this point in time):
CACPPT logs:
2023-10-20T06:56:47Z INFO reconcile TalosControlPlane {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "be96027a-b052-4819-bd53-8215a326733f", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:47Z INFO controllers.TalosControlPlane bootstrap failed, retrying in 20 seconds {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-bzpgr\""}
2023-10-20T06:56:47Z INFO controllers.TalosControlPlane attempting to set control plane status
2023-10-20T06:56:57Z INFO controllers.TalosControlPlane failed to get kubeconfig for the cluster {"error": "failed to create cluster accessor: error creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "errorVerbose": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\nerror creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:396\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:299\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:273\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:275\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/[email protected]/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}
2023-10-20T06:56:57Z INFO controllers.TalosControlPlane successfully updated control plane status {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:57Z INFO reconcile TalosControlPlane {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "2bb6e4b1-8a51-4c48-b463-eb6b0a915de8", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:57Z INFO controllers.TalosControlPlane bootstrap failed, retrying in 20 seconds {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-bzpgr\""}
2023-10-20T06:56:57Z INFO controllers.TalosControlPlane attempting to set control plane status
Thanks in advance for your help and insights.
It was discussed in community Slack, but it didn't quite go that far.
VMWare users need to reimplement vmtoolsd to be a Talos system extension (and an extension service), this way it will run always with the machine.
Another option is to make Talos itself report IPs, if we can do that without pulling all VMWare libraries in.
Hi everyone, I face the same problem right now. Are there any updates or instructions to follow to work around this?
Also interested to see the fix for this issue, thanks?
I found a way to deploy, just create a TalosOS with vmtoolds installed by default using Talos image fabric and the use that one as baseline template for the deployment, please check here [https://factory.talos.dev/].