[Bug Report] Nodes Keep Disconnecting
Describe the bug:
The remote nodes are periodically dropping off. Sometimes they reconnect and sometimes I need to manually restart the nodes service or UI.
- OpenSnitch version: 1.7.2 Node and GUI
- OS: Debian
- OS version: Both Sid and Stable
- Window Manager: I3 and Headless
- Kernel version: 6.17.7+deb14+1-amd64 and 6.12.48+deb13-amd64
To Reproduce:
Steps to reproduce the behavior: install GUI and configure remote node to connect to the GUI. Leave VMs Connected and powered on
Post error logs:
When checking the surface which has not crashed I get lots of the following errors:
sudo service opensnitch status
opensnitchd[1204]: Timed out while sending packet to queue channel 1313700632
opensnitchd[1204]: Timed out while sending packet to queue channel 1313700632
opensnitchd[1204]: Timed out while sending packet to queue channel 1313700632
cat /var/log/opensnitchd.log
WAR Error while pinging UI service: rpc error: code = DeadlineExceeded desc = context deadline exceeded, state: READY
WAR Error while pinging UI service: rpc error: code = DeadlineExceeded desc = context deadline exceeded, state: READY
WAR Error while pinging UI service: rpc error: code = DeadlineExceeded desc = context deadline exceeded, state: READY
WAR Error while pinging UI service: rpc error: code = DeadlineExceeded desc = context deadline exceeded, state: READY
WAR Error while pinging UI service: rpc error: code = DeadlineExceeded desc = context deadline exceeded, state: READY
ERR Subscribing to GUI rpc error: code = DeadlineExceeded desc = context deadline exceeded
ERR Connection to the UI service lost.
IMP UI connected, dispathing queued alerts: 0
getting notifications: rpc error: code = Unavailable desc = transport is closing <nil>
Connection to the UI service lost.
UI connected, dispathing queued alerts: 0
getting notifications: rpc error: code = Unavailable desc = transport is closing <nil>
Connection to the UI service lost.
UI connected, dispathing queued alerts: 0
Screenshots:
Additional context:
hi @zero77 ,
I cannot tell what's going on with these logs, I'll have to try to reproduce it.
"Error while pinging UI service" could mean that the GUI is receiving too many events, Change "Refresh interval" to something greater than 0, and see if the situation improves.
"Subscribing to GUI rpc error:" There's a hardcoded timeout of 10 seconds when connecting to the GUI, which should be more than enough. So I'd monitor what's the GUI doing. Is it stuck for some reason, consuming all the CPU?
Most of the time it is using very little CPU though I have not checked its CPU usage when the nodes drop off although, I don't always notice until I go into the UI that the nodes have dropped off.
I have set the refresh rate to two seconds, but I can increase it. Let me know if you want more logs or for me to do any testing.
ok, I'll try to reproduce it, maybe we have some deadlocks when working with multiple nodes.
I think I've reproduced this behaviour. It seems that when the system loses and regains network connectivity, we subscribe correctly to the GUI but the channel to send data to the GUI is not established. That explains the Error while pinging UI service.* error.
I'll try to fix this behaviour.
Thank you for looking into this. Just to clarify, is this the same issue for the remote nodes as well as the node running on the same device as the UI?
Also, would it be possible to implement some kind of retry in case something like this happens again in the future?
I don't think so. What's the issue of a node running on the same device? I've got 7 remote VMs with the daemon running, and a laptop with the GUI + a daemon.
On the other hand, yes, we have several retrying mechanisms. Mainly the ones provided by the gRPC library. And on top of this, when the connection is in a transient weird state, we force a disconnection, to force a reconnection.
The issue I reproduced was with v1.7.2 (compiled on Arch), and it happened when the GUI stopped responding (for example when I suspend the laptop, but the remote nodes keep running). However I've tested it with with latest sources, v1.8.0 (compiled on my machine), and I can no longer reproduced it no matter what I do.
So I was wondering if the gRPC versions GUI<->daemons has something to do with this behaviour.
opensnitchd[1204]: Timed out while sending packet to queue channel 1313700632
By the way, these errors are normal if they're sporadic. If they're continuous then it indicates an issue with the daemon.
It's the same issue with both the local and remote node, but the remote nodes have this problem a lot more regularly. Did all nodes disconnect together and did you find that the node would reconnect automatically when you opened the laptop lid or did you also have to restart the Demon or UI to fix this?
Because I am finding that all nodes have this problem, but not all at once.
I've found 2 issues:
- Sometimes changing manually the Server Address in the default-configuration.json file doesn't work, and you have to change it one more time. This is not related to this bug report.
- After opening the laptop lid and restore the system, which is where the GUI is running, the nodes reconnect but it seems that we're not accepting the connection on the GUI. So we don't update the status of the nodes.
But I haven't reproduced exactly what you describe. Could you set the log level to DEBUG at least in one of those machines, to see more in detail the connection behaviour to the GUI?
In any case, I'm using this opportunity to improve the management of multiple nodes.
I turned on debugging and waited until the node disconnected. This is just some of the logs that have the same error with debugging, as it filled up quickly to over 100 MB.
https://pastebin.com/CXNEVXxX
Unfortunately there isn't much information inthe logs :/ I'll have to keep trying to reproduce it.
I wasn't able to identify when exactly the node disconnected. So I just provided the logs that had the same errors as last time. I can have another look through to see if I can get any more information from the logs.
I reproduce the warning Error while pinging UI service ... DeadlineExceeded in a LXC container with low resources (1core,cpu + 512mb).
The thing is, when sending the events to the GUI, we're not only measuring how long the post to the GUI takes, but also the collection of the stats before sending them. So in some scenarios, this can affect the deadline. We should only measure the network post.
Let's see if I can fix at least the warning I've reproduced.
After some debugging, these errors may occur if the GUI is under load, such as deleting old records from the DB in background. The GUI seems to stop responding for a short time, not much, but enough to trigger the deadline of 1 second of the daemon.
This behaviour can also be reproduced by forcing the GUI to execute CPU intensive tasks, such as monitoring listening sockets (Netstat tab) or monitoring a PID (Process dialog). Again, those tasks briefly block the GUI, but it's enough to meet the deadline of the daemon.
If there's more than one node connected to the GUI, I guess this behaviour will occur more often. It's not related to how much stats the daemon has in memory.
Given that these warnings seems to be normal at least in this scenario, maybe we could change the log level to Debug.
Except for the first post I have been looking at the logs on the node. Did you find the GUI logs more useful and if so, is it worth saving them to a log file as well.
Did you manage to identify when exactly the node disconnected and didn't reconnect in the log as I can have a look at this part in more detail, as my log is very large and not very helpful so far.
Unfortunately the GUI doesn't have logs, it only prints some errors or exceptions. I've been debugging it manually.
Did you manage to identify when exactly the node disconnected and didn't reconnect in the log as I can have a look at this part in more detail, as my log is very large and not very helpful so far.
Not yet. I've configured 6 VMs , 1 running the GUI, and for now I haven't seen any the disconnection in 3 days. These VMs don't have much activity
By the way, write down the PIDs of the daemons running in the nodes, and when you see another disconnection, check wether the PIDs are the same. One reason of the disconnections could be that the daemon are dying for some reason. I doubt it but not impossible.
Use systemctl status opensnitch to return the daemon lifetime after a disconnection. If it's dying, it should be low, like few minutes.
Active: active (running) since Thu 2025-12-04 21:39:54 UTC; 19h ago
^^^^^^^^^
I've been looking through the logs a bit more and any disconnects in the logs tend to be followed by errors. Although I am not sure if these errors are when the node stays disconnected afterwards.
https://pastebin.com/1v82jvwG
As for the PID change, I am still monitoring it, but when I last checked, the demon active time was two days, so I don't think it's dying.
When looking through the settings, I noticed this had been set to deny.
I have now changed it to allow to see if that makes any difference.
Default action when the GUI is disconnected
Not sure it's needed, but the node configuration is the following.
sudo cat /etc/opensnitchd/default-config.json
{
"Server": {
"Address": "Server-IP:50051",
"Authentication": {
"Type": "simple",
"TLSOptions": {
"CACert": "",
"ServerCert": "",
"ClientCert": "",
"ClientKey": "",
"SkipVerify": false,
"ClientAuthType": "no-client-cert"
}
},
"LogFile": "/var/log/opensnitchd.log"
},
"DefaultAction": "deny",
"DefaultDuration": "once",
"InterceptUnknown": false,
"ProcMonitorMethod": "ebpf",
"LogLevel": 0,
"LogUTC": false,
"LogMicro": false,
"Firewall": "iptables",
"FwOptions": {
"ConfigPath": "/etc/opensnitchd/system-fw.json",
"MonitorInterval": "15s",
"QueueBypass": true
},
"Rules": {
"Path": "/etc/opensnitchd/rules/",
"EnableChecksums": false
},
"Ebpf": {
"EventsWorkers": 8,
"QueueEventsSize": 0
},
"Stats": {
"MaxEvents": 250,
"MaxStats": 25,
"Workers": 6
},
"Internal": {
"GCPercent": 100,
"FlushConnsOnStart": true
}
"Firewall": "iptables", ok, this is interesting. Since we added nftables support I think I've never used iptables again, except on some old servers. Why are you using iptables? compatibility, serve configuration uses iptables maybe?
Anyway, I've configured a machine with iptables and the icmp rule, to see if it causes the problem.
I've reproduced an issue with 6 nodes, where the server ends up not accepting new requests, new nodes, etc. I've explained it in the commit above. The pings failed by timeout, until one of the nodes disconnect.
How many nodes do you have @zero77 ? it'd be worth testing the opensnitch-ui of latest commit: https://github.com/evilsocket/opensnitch/blob/c356c82ac79907c82d1631b2e584a0e883754942/ui/bin/opensnitch-ui
like this: ./opensnitch-ui --max-workers 100
@gustavo-iniguez-goya Thank you for having a look into this and for the pull request.
I used to have more than six nodes, but I am now down to three, but after this and https://github.com/evilsocket/opensnitch/issues/1438 has been fixed I will add more again. I am using iptables for now because of older packages in the debian stable repository.
Can the workers be changed in the config or is it only by command line. Also is there a new release with this fix coming out soon or do I need to build from source.
The workers will be configurable from the GUI. But if you want to test it, just download the opensnitch-ui from the link above, which is from the latest commit. Direct link: https://raw.githubusercontent.com/evilsocket/opensnitch/c356c82ac79907c82d1631b2e584a0e883754942/ui/bin/opensnitch-ui
Then close the current GUI, and launch the new opensnitch-ui from the command line like this: ./opensnitch-ui --max-workers 30.
No problem about using iptables by the way. I've tested it and it was not causing this issue.
The new release is imminent and will contain all these fixes. It'll be more multi-node friendly.
I closed the old UI and tested the new eye and got the following Error and a crash.
python3 opensnitch-ui --max-workers 30
Loaded network aliases from /usr/lib/python3/dist-packages/opensnitch/utils/network_aliases/network_aliases.json
~ OpenSnitch GUI - 1.7.2 ~
protobuf: 5.29.4 - grpc: 1.51.1
--------------------------------------------------
QT_AUTO_SCREEN_SCALE_FACTOR: True
gRPC Max Message Length: 8MiB
Bytes: 8388608
[server] addr: [::]:50051
Setting synchronous = NORMAL
schema version: 3
db schema is up to date
QWidget: Must construct a QApplication before a QWidget
Aborted (core dumped) python3 opensnitch-ui --max-workers 30
ops, ok, nevermind. I'll release the new version soon which I think should fix this issue.
@zero77 v1.8.0 is out. Let's see if it fixes these issues:
https://github.com/evilsocket/opensnitch/releases/tag/v1.8.0
Thank you. I have updated the UI and all of the nodes and I will watch them to see if they still disconnect.