talos copied to clipboard
Cilium not get installed with TF
Bug Report
I try to deploy a Talos Cluster via Terraform in AWS and as soon I disable Kube-proxy and enable Cilium CNI to be installed as default it get stuck. The only way to install Cilium it create a Talos Cluster with kube-proxy and then install Cilium as a postinstallation.
I'm using following terraform template with some few modification where we create a new VPC. https://github.com/isovalent/terraform-aws-talos
But as soon we deploy the environment it get stuck that Cilium it not get installed. As soon we comment the out
cni = {
name = "none"
The installation continues and we get a Talos Cluster up and running.
kubectl describe nodes ip-10-0-4-40
Name: ip-10-0-4-40
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
Annotations: node.alpha.kubernetes.io/ttl: 0
talos.dev/owned-labels: ["node-role.kubernetes.io/control-plane"]
talos.dev/owned-taints: ["node-role.kubernetes.io/control-plane"]
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 25 Mar 2024 13:46:37 +0100
Taints: node-role.kubernetes.io/control-plane:NoSchedule
Unschedulable: false
HolderIdentity: ip-10-0-4-40
AcquireTime: <unset>
RenewTime: Tue, 26 Mar 2024 09:52:56 +0100
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 26 Mar 2024 09:51:14 +0100 Mon, 25 Mar 2024 13:46:37 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 26 Mar 2024 09:51:14 +0100 Mon, 25 Mar 2024 13:46:37 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 26 Mar 2024 09:51:14 +0100 Mon, 25 Mar 2024 13:46:37 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Tue, 26 Mar 2024 09:51:14 +0100 Mon, 25 Mar 2024 13:46:37 +0100 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Hostname: ip-10-0-4-40
cpu: 2
ephemeral-storage: 49932Mi
hugepages-2Mi: 0
memory: 1959156Ki
pods: 110
cpu: 1950m
ephemeral-storage: 46853311615
hugepages-2Mi: 0
memory: 1660148Ki
pods: 110
System Info:
Machine ID: 6da8f3b88cc8a2fd83019be63f98e76c
System UUID: ec251856-1324-369c-330a-ef457cdcd067
Boot ID: 60826cf0-98de-44ba-8b34-bfff15e66521
Kernel Version: 6.1.80-talos
OS Image: Talos (v1.6.6)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.13
Kubelet Version: v1.29.2
Kube-Proxy Version: v1.29.2
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system kube-apiserver-ip-10-0-4-40 200m (10%) 0 (0%) 512Mi (31%) 0 (0%) 20h
kube-system kube-controller-manager-ip-10-0-4-40 50m (2%) 0 (0%) 256Mi (15%) 0 (0%) 20h
kube-system kube-scheduler-ip-10-0-4-40 10m (0%) 0 (0%) 64Mi (3%) 0 (0%) 20h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 260m (13%) 0 (0%)
memory 832Mi (51%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
- Talos version: 1.6.6
- Kubernetes version: 1.29.2
- Platform: AWS
it looks like csr is enabled, so needs to manually approve the CSR
You are maybe right, and when I try to approve the pending CSRs I got No resource found
$ kubectl get csr
csr-2zqft 70s kubernetes.io/kubelet-serving system:node:ip-10-0-5-40 <none> Pending
csr-497p4 31m kubernetes.io/kubelet-serving system:node:ip-10-0-4-194 <none> Pending
csr-4ftvr 31m kubernetes.io/kubelet-serving system:node:ip-10-0-5-124 <none> Pending
csr-6q6sd 46m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:0dfxfl <none> Approved,Issued
csr-82kgd 70s kubernetes.io/kubelet-serving system:node:ip-10-0-4-65 <none> Pending
csr-8hdns 46m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:0dfxfl <none> Approved,Issued
csr-b7x77 46m kubernetes.io/kubelet-serving system:node:ip-10-0-6-78 <none> Pending
csr-c9j4n 46m kubernetes.io/kubelet-serving system:node:ip-10-0-4-194 <none> Pending
csr-fwfp8 16m kubernetes.io/kubelet-serving system:node:ip-10-0-6-78 <none> Pending
csr-gbvfs 16m kubernetes.io/kubelet-serving system:node:ip-10-0-5-40 <none> Pending
csr-jsn69 46m kubernetes.io/kubelet-serving system:node:ip-10-0-4-65 <none> Pending
csr-k6xss 16m kubernetes.io/kubelet-serving system:node:ip-10-0-4-194 <none> Pending
csr-l99bw 31m kubernetes.io/kubelet-serving system:node:ip-10-0-6-78 <none> Pending
csr-mfbmf 31m kubernetes.io/kubelet-serving system:node:ip-10-0-5-40 <none> Pending
csr-mwr26 46m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:0dfxfl <none> Approved,Issued
csr-s8wd5 46m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:0dfxfl <none> Approved,Issued
csr-sfpgn 46m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:0dfxfl <none> Approved,Issued
csr-shhdn 16m kubernetes.io/kubelet-serving system:node:ip-10-0-4-65 <none> Pending
csr-t6zst 72s kubernetes.io/kubelet-serving system:node:ip-10-0-4-194 <none> Pending
csr-t7nqd 46m kubernetes.io/kubelet-serving system:node:ip-10-0-5-124 <none> Pending
csr-twcg6 71s kubernetes.io/kubelet-serving system:node:ip-10-0-6-78 <none> Pending
csr-vtnvl 46m kubernetes.io/kubelet-serving system:node:ip-10-0-5-40 <none> Pending
csr-x52sh 73s kubernetes.io/kubelet-serving system:node:ip-10-0-5-124 <none> Pending
csr-xs6xd 31m kubernetes.io/kubelet-serving system:node:ip-10-0-4-65 <none> Pending
csr-xxhwq 16m kubernetes.io/kubelet-serving system:node:ip-10-0-5-124 <none> Pending
$ kubectl certificate approve csr-2zqft
No resources found
error: the server doesn't have a resource type "certificatesigningrequests"
I'm not sure about the csr not being found, the error seems super weird
As far as I'm aware this is a "known" issue when using an alternative CNI: https://www.talos.dev/v1.6/kubernetes-guides/network/deploying-cilium/#method-1-helm-install
After applying the machine config and bootstrapping Talos will appear to hang on phase 18/19 with the message: retrying error: node not ready. This happens because nodes in Kubernetes are only marked as ready once the CNI is up. As there is no CNI defined, the boot process is pending and will reboot the node to retry after 10 minutes, this is expected behavior.
So you have to manually setup a CNI of your choice.
EDIT: Or host the template file and use it in a patch.