MeshAgent icon indicating copy to clipboard operation
MeshAgent copied to clipboard

Linux Mesh Agent hangs post-TLS handshake during WebSocket upgrade (Worked Previously)

Open PoptopL opened this issue 6 months ago • 3 comments

The MeshCentral agent on a specific Ubuntu 25.04 machine, which previously worked correctly, now fails to connect. The meshagent.service reports as running (or the agent can be run manually). However, the device never appears as connected in the MeshCentral server console. When run manually, the agent prints "Connecting to wss://[YOUR_MESH_SERVER_HOSTNAME]:443/agent.ashx" and then hangs indefinitely. Interrupting with Ctrl+C causes it to print "Connected." and then exit, but this does not reflect a true, stable connection.

Though, other devices on the same local network (including other Linux devices) connect successfully to the same MeshCentral server. Furthermore, this specific agent on this machine used to work without issue and then suddenly stopped connecting, despite no known manual configuration changes to the client machine or the agent installation prior to the issue arising.

Something of thing of note is that occasionally, with no reliable reproducibility, stopping the service leads to the device showing up on meshcentral, but simply offline. It has no other information gathered, and the logs simply say, "Added device [name] to device group [x]". Another symptom is that I am unable to stop the mesh central service gracefully (only by killing it and disabling, etc), but I believe this is due to the agent being stuck connecting to the WSS URI and not accepting a call to exit.

I am unsure as to how to reproduce this error, as any subsequent attempts on my part are not fruitful other than the problematic machine.

Other Info

  1. DNS Resolution:

    • ping [YOUR_MESH_SERVER_HOSTNAME] resolves correctly.
    • telnet [YOUR_MESH_SERVER_HOSTNAME] 443 connects successfully (TCP layer OK).
  2. TLS Handshake:

    • openssl s_client -connect [YOUR_MESH_SERVER_HOSTNAME]:443 -servername [YOUR_MESH_SERVER_HOSTNAME] completes successfully with Verify return code: 0 (ok).
    • Server certificate is valid (Let's Encrypt) and trusted by the client system.
    • ca-certificates package is up-to-date. System time is correct.
  3. Local Firewall (ufw): Inactive.

  4. VPN/Proxy: Issue persists identically with VPN/proxy software completely disabled. VPN is not the cause.

  5. Agent Reinstallation: Multiple forceful removals (service files, all known agent directories: /opt/meshagent/, /usr/local/mesh/, /usr/local/mesh_services/, /var/opt/meshagent/) and reinstallations using the official installer script. Issue remains.

  6. Agent Configuration (.msh file): Correctly contains MeshServer=wss://[YOUR_MESH_SERVER_HOSTNAME]:443/agent.ashx, and is nearly identical except for relevant agent information to other working mesh agents on other devices.

  7. Systemd Service Configuration: Runs agent as root. StandardOutput was initially null, changed to journal, but logs then only showed the same "Connecting to..." message followed by the hang.

  8. strace ./meshagent (summary):

    • Completes extensive system/hardware information gathering via child processes.
    • Agent attempts to openat() several .js files (e.g., linux-gnome-helpers.js) resulting in ENOENT; understood to be non-critical for native agent core function.
    • Successfully resolves server hostname.
    • Prints "Connecting to..." message.
    • Establishes TCP connection (non-blocking connect() returns EINPROGRESS, later confirmed by pselect6 socket becoming writable).
    • Successfully completes TLS handshake (sends Client Hello, exchanges TLS records with server).
    • Sends a final block of data (presumed WebSocket upgrade request / initial application data).
    • Hangs at this point, likely in pselect6() or ppoll() waiting for a response on the socket.
    • Ctrl+C (SIGINT) interrupts this wait, triggering a cleanup sequence that misleadingly prints "Connected." before exit.
  9. DMI Information: Agent reads DMI info (e.g., "NO Asset Tag" for board asset tag) successfully before the network hang. I don't believe this would impact anything, but issues #141 and #272 lead me to think otherwise, or maybe that some agent information is causing issues. (except in my case it doesn't continuously do so)

PoptopL avatar May 29 '25 04:05 PoptopL

Let's Encrypt

can u verify if the ssl is rsa or ecdsa? This seems to be very common at the moment but ecdsa isn't supported! The certificate must be rsa!

si458 avatar May 29 '25 07:05 si458

The certificate is rsa, as specified below: Peer signature type: RSA-PSS
Server public key is 4096 bit

Although, the issue I am having is only affecting one specific device. I have other ubuntu machines that are connecting just fine on the same network.

PoptopL avatar May 29 '25 13:05 PoptopL

is this still an issue or can it be closed?

si458 avatar Jun 21 '25 10:06 si458