talos
talos copied to clipboard
`talosctl reset` hangs quite often, even if the node is shutdown already
Bug Report
Description
When running talosctl reset either with or without --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL, it happens quite often (4/6 times for me) that the command never finishes.
The machine shutdown already, and I can safely run kubectl delete node, too.
It's just the reset command that somehow still expects some feedback from the node.
Logs
◲ watching nodes: [<ip-redacted>]
* <ip-redacted>: phase: unmountSystem action: STOP
FYI: I had different phases displayed in other cases.
Environment
- Talos version: 1.10.3
- Kubernetes version: 1.33.1
- Platform: hcloud
Please provide console logs while the reset command is issued (talosctl support would not help us here since machine is getting rebooted). Hcloud might provide a way to access machine logs via their cloud console
Well, since the reset wipes the logs on disk ... and hcloud sadly doesn't store them elsewhere automatically... The best I could do are screenshots and maybe a video.
But now, I've already finished the upgrade, so I think I'll close this for now. If someone else has the same problem, they can reopen.
The symptom like this one is that you use some form of VIP for the endpoints in the talosconfig. Don't do this, and use direct controlplane IPs, talosctl will handle the rest.
To make sure:
Do I need the endpoints for anything, or could I also just omit them completely?
Currently, I'm using them like this, but now that you say it, the endpoints field might be completely unnecessary...
The relevant part of my talosconfig:
contexts:
example:
endpoints:
- IP-A-redacted
- IP-B-redacted
nodes:
- IP-A-redacted
- IP-B-redacted
EDIT: From the docs:
endpoints are the communication endpoints to which the client directly talks. These can be load balancers, DNS hostnames, a list of IPs, etc. If multiple endpoints are specified, the client will automatically load balance and fail over between them. It is recommended that these point to the set of control plane nodes, either directly or through a load balancer.
In my case it's a list of IPs. I don't fully understand yet why I should not set the endpoints in my case.
Doesn't that break other stuff? For example talosctl version and talosctl dashboard -n IP-A-redacted both return error constructing client: failed to determine endpoints without endpoints in the talosconfig.
I don't fully understand yet why I should not set the endpoints in my case.
I don't understand your question. You should set the endpoints, and make sure those are set to machine IPs. Endpoints are required.
Floating IPs/VIPs, etc. would behave like you described - the communication would appear to hang on operations which mutate Talos state, like reset, Kubernetes upgrades, etc.
I see - That was exactly what I meant.
Well, since I'm not using floating IPs/VIPs, I don't think they are the reason for the issues I'm facing 😄
I'm having the same issue. I've hardly ever seen talosctl reset not hang. We don't use VIPs etc. Network config is essentially just dhcp: true.
Could the issue occur when using the hostname to resolve the node instead of the IP directly (via the machine.features.hostDNS.resolveMemberNames feature)?
Example of the server having completed reset and shut down but the command being stuck: