Talos service in default namespace keeps old nodes IPs as ready and serving
Bug Report
Description
Hello!
Thanks for this great and secure OS for Kubernetes.
I'm currently evaluating Talos Linux with Cluster API in a vSphere environment. After some rolling updates, when I checked the talos service and the endpointslides associated, old node IPs were not removed and are still declared "ready" and "serving" (with terminating as false).
I discovered the issue because after installing the talos-backup the job kept failing due to the high number of old controlplane nodes.
Thank you!
Logs
Endpointslices associated to the talos service:
k get endpointslices.discovery.k8s.io talos-ipv4 -o yaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 192.168.1.191
conditions:
ready: true
serving: true
terminating: false
- addresses:
- 192.168.1.63
conditions:
ready: true
serving: true
terminating: false
- addresses:
- 192.168.1.192
conditions:
ready: true
serving: true
terminating: false
- addresses:
- 192.168.1.64
conditions:
ready: true
serving: true
terminating: false
- addresses:
- 192.168.1.193
conditions:
ready: true
serving: true
terminating: false
- addresses:
- 192.168.1.194
conditions:
ready: true
serving: true
terminating: false
- addresses:
- 192.168.1.95
conditions:
ready: true
serving: true
terminating: false
[...]
kind: EndpointSlice
[...]
The controlplane nodes:
k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
controlplane-fvbtq Ready control-plane 78m v1.34.0 192.168.1.95 192.168.1.95 Talos (v1.11.5) 6.12.57-talos containerd://2.1.5
controlplane-nnqcd Ready control-plane 86m v1.34.0 192.168.1.193 192.168.1.193 Talos (v1.11.5) 6.12.57-talos containerd://2.1.5
controlplane-p4bdg Ready control-plane 82m v1.34.0 192.168.1.194 192.168.1.194 Talos (v1.11.5) 6.12.57-talos containerd://2.1.5
Environment
- Talos version: v1.11.5
- Kubernetes version: 1.34.0-1.34.2
- Platform: Linux
Talos Linux relies on discovery service data to populate the list.
If you re-use same secrets for a cluster which is changing IPs of some nodes, you might hit state discovery data, which will automatically disappear in around ~30 mins.
You can use talosctl get members to look further into it.
All right, thanks for your quick reply, the process of secrets bootstrap should be handled by Cluster API and Talos Controlplane or Bootstrap provider, correct?
talosctl get members returns the same result of nodes than the k get nodes command.
If talosctl get members returns same result, grab a support bundle (talosctl support) and create a bug report please.