beszel icon indicating copy to clipboard operation
beszel copied to clipboard

[Bug]: agent fails to connect to Websocket if DNS lookup fails (once) - does not retry?

Open luckman212 opened this issue 4 months ago • 6 comments

Component

Agent

Description

This morning I logged into my Beszel hub and saw one node was disconnected (websocket). Upon inspection, I saw an error WARN WebSocket connection failed err="dial tcp: lookup beszel.acme.foo: i/o timeo>:

in the text below, beszel.acme.foo is the redacted public DNS name of my hub

# service beszel-agent status
● beszel-agent.service - Beszel Agent Service
     Loaded: loaded (/etc/systemd/system/beszel-agent.service; enabled; preset: enabled)
     Active: active (running) since Mon 2025-10-27 21:44:01 EDT; 11h ago
 Invocation: ffd023a7a991495095e225954a379289
   Main PID: 1163 (beszel-agent)
      Tasks: 12 (limit: 75983)
     Memory: 12.3M (peak: 15.7M)
        CPU: 2.505s
     CGroup: /system.slice/beszel-agent.service
             └─1163 /opt/beszel-agent/beszel-agent

Oct 27 21:44:01 pve02 systemd[1]: Started beszel-agent.service - Beszel Agent Service.
Oct 27 21:44:01 pve02 beszel-agent[1163]: 2025/10/27 21:44:01 INFO Data directory path=/var/lib/beszel-agent
Oct 27 21:44:01 pve02 beszel-agent[1163]: 2025/10/27 21:44:01 INFO Detected root device name=dm-1
Oct 27 21:44:01 pve02 beszel-agent[1163]: 2025/10/27 21:44:01 INFO Detected network interface name=tailscale0 sent=48 recv=86
Oct 27 21:44:06 pve02 beszel-agent[1163]: 2025/10/27 21:44:06 WARN WebSocket connection failed err="dial tcp: lookup beszel.acme.foo: i/o timeo>
Oct 27 21:44:06 pve02 beszel-agent[1163]: 2025/10/27 21:44:06 INFO Starting SSH server addr=:45876 network=tcp
Oct 27 21:44:08 pve02 beszel-agent[1163]: 2025/10/27 21:44:08 INFO SSH connected addr=10.101.101.10:42864
Oct 27 21:44:08 pve02 beszel-agent[1163]: 2025/10/27 21:44:08 INFO SSH connection established
Oct 27 21:49:13 pve02 beszel-agent[1163]: 2025/10/27 21:49:13 INFO SSH connected addr=10.101.101.10:60632

This probably occurred during a normal maintenance period when I was rebooting servers and DNS might have been down for a few minutes. Simply bouncing the agent (service beszel-agent restart) got it working again. I think the agent should periodically retry failed DNS lookups if the connection has failed...

Category

Other

OS / Architecture

Linux/amd64 (Debian 13 trixie)

Beszel version

0.15.0

Installation method

Docker

Perhaps somewhat related

  • https://github.com/henrygd/beszel/issues/1124
  • https://github.com/henrygd/beszel/issues/601

luckman212 avatar Oct 28 '25 13:10 luckman212

Thanks for reporting.

To clarify, did it fall back to connect successfully via SSH, or was it fully disconnected?

If it does connect via SSH then it stops attempting to connect via WebSocket. I may change the LISTEN env var so defining LISTEN="" disables the SSH server.

henrygd avatar Oct 28 '25 16:10 henrygd

It was red in the UI (disconnected) but I can't remember if that was before or after I blocked SSH using iptables on the Hub container. I think before!

luckman212 avatar Oct 28 '25 17:10 luckman212

This probably occurred during a normal maintenance period when I was rebooting servers and DNS might have been down for a few minutes. Simply bouncing the agent (service beszel-agent restart) got it working again. I think the agent should periodically retry failed DNS lookups if the connection has failed.

I've run into something similar and thought I'd mention it here in case it helps someone in the future.

All of my remote agents connect to my hub using Tailscale. This works great except sometimes during a reboot an agent queries DNS before Tailscale has fully come up and is able to answer DNS queries for its domain, so the agent gives up and starts the SSH server and the hub connects the "old way".

I believe I have worked around this with changes to two systemd override files. First, tell the agent to wait until Tailscale starts:

sudo systemctl edit beszel-agent.service

[Unit]
After=tailscaled.service

Then use this little trick I stole from someone else to prevent Tailscale from reporting it has started to systemd before it's really ready to provide service:

sudo systemctl edit tailscaled.service

[Service]
ExecStartPost=timeout 60s bash -c 'until tailscale status --peers=false; do sleep 1; done'

davidemyers avatar Oct 30 '25 20:10 davidemyers

That's a nice trick, but thinking about this further: if we use a hostname for our hub rather than a static IP address, is that DNS name periodically looked up in case it happens to change or be on a dynamic IP? This probably would also solve the original problem, and, is something that should be implemented anyway as many will be using Beszel in homelab environments where static IPs are uncommon.

luckman212 avatar Oct 30 '25 20:10 luckman212

In the case of Tailscale, IP addresses are effectively static so I could have hard-coded the hub address to work around my problem, but that feels inelegant.

Seems to me using a static IP (or DHCP reservation) for the system where the hub runs would be a best practice.

davidemyers avatar Oct 30 '25 20:10 davidemyers

I have static IPs for my DMZ and Docker host behind Traefik proxy, I am not talking about internal LAN IPs. I mean WAN IPs when agents may be located outside of the Hub network (don't want to / can't install Tailscale on every device either)

luckman212 avatar Oct 30 '25 22:10 luckman212