talos icon indicating copy to clipboard operation
talos copied to clipboard

`talosctl reset` hangs quite often, even if the node is shutdown already

Open thetillhoff opened this issue 5 months ago • 6 comments
trafficstars

Bug Report

Description

When running talosctl reset either with or without --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL, it happens quite often (4/6 times for me) that the command never finishes. The machine shutdown already, and I can safely run kubectl delete node, too. It's just the reset command that somehow still expects some feedback from the node.

Logs

 ◲ watching nodes: [<ip-redacted>]
    * <ip-redacted>: phase: unmountSystem action: STOP

FYI: I had different phases displayed in other cases.

Environment

  • Talos version: 1.10.3
  • Kubernetes version: 1.33.1
  • Platform: hcloud

thetillhoff avatar May 29 '25 19:05 thetillhoff

Please provide console logs while the reset command is issued (talosctl support would not help us here since machine is getting rebooted). Hcloud might provide a way to access machine logs via their cloud console

frezbo avatar May 30 '25 08:05 frezbo

Well, since the reset wipes the logs on disk ... and hcloud sadly doesn't store them elsewhere automatically... The best I could do are screenshots and maybe a video.

But now, I've already finished the upgrade, so I think I'll close this for now. If someone else has the same problem, they can reopen.

thetillhoff avatar May 30 '25 11:05 thetillhoff

The symptom like this one is that you use some form of VIP for the endpoints in the talosconfig. Don't do this, and use direct controlplane IPs, talosctl will handle the rest.

smira avatar Jun 02 '25 09:06 smira

To make sure: Do I need the endpoints for anything, or could I also just omit them completely? Currently, I'm using them like this, but now that you say it, the endpoints field might be completely unnecessary...

The relevant part of my talosconfig:

contexts:
    example:
        endpoints:
            - IP-A-redacted
            - IP-B-redacted
        nodes:
            - IP-A-redacted
            - IP-B-redacted

EDIT: From the docs:

endpoints are the communication endpoints to which the client directly talks. These can be load balancers, DNS hostnames, a list of IPs, etc. If multiple endpoints are specified, the client will automatically load balance and fail over between them. It is recommended that these point to the set of control plane nodes, either directly or through a load balancer.

In my case it's a list of IPs. I don't fully understand yet why I should not set the endpoints in my case.

Doesn't that break other stuff? For example talosctl version and talosctl dashboard -n IP-A-redacted both return error constructing client: failed to determine endpoints without endpoints in the talosconfig.

thetillhoff avatar Jun 02 '25 17:06 thetillhoff

I don't fully understand yet why I should not set the endpoints in my case.

I don't understand your question. You should set the endpoints, and make sure those are set to machine IPs. Endpoints are required.

Floating IPs/VIPs, etc. would behave like you described - the communication would appear to hang on operations which mutate Talos state, like reset, Kubernetes upgrades, etc.

smira avatar Jun 03 '25 08:06 smira

I see - That was exactly what I meant.

Well, since I'm not using floating IPs/VIPs, I don't think they are the reason for the issues I'm facing 😄

thetillhoff avatar Jun 03 '25 10:06 thetillhoff

I'm having the same issue. I've hardly ever seen talosctl reset not hang. We don't use VIPs etc. Network config is essentially just dhcp: true.

Could the issue occur when using the hostname to resolve the node instead of the IP directly (via the machine.features.hostDNS.resolveMemberNames feature)?

Example of the server having completed reset and shut down but the command being stuck:

Image

dbackeus avatar Jul 23 '25 08:07 dbackeus