tunneldigger
tunneldigger copied to clipboard
The client can get stuck in a high-frequency retry loop despite working brokers
Sometimes a client seems to be stuck in a high-frequency retry loop, establishing a new connection every 2-5s and immediately abandoning it. https://github.com/wlanslovenija/tunneldigger/issues/143 is resolved, which means the servers do not have heavy load from such a loop any more, but the question remains what is going on with those clients.
Unfortunately, so far I was not able to acquire a log from one of the affected nodes. @kaechele you said you also experienced this problem; were/are you able to get a logfile from the problematic node?
Unfortunately, Dr. Jung (:tada:), I have been unable to capture a log in this situation. But I would not be surprised given the current state of the client that there is some case in which the state machine goes berserk.
Thanks. :) (However you found out about that.^^)
Yeah, I have fairly low confidence in that state machine code... but without a log I also don't really know where to start debugging this.
We are facing this issue for some months now, but we wherent able to get access to one of the affected routers till now. We observed that this symptom always starts after the "daily" reboot and affects around 70-80 Routers out of 1000.
When this happens you see Tunneldigger Broker instantly running the "down-hook" after creating the tunnel. This is what we have seen on the Tunneldigger Client on the Router.
Sat Jun 11 13:09:25 2022 daemon.debug td-client: [broker2.ff-en.de:10122] Broker usage: 4587
Sat Jun 11 13:09:25 2022 daemon.info td-client: Selected broker1.ff-en.de:10122 as the best broker.
Sat Jun 11 13:09:25 2022 daemon.info td-client: [broker1.ff-en.de:10122] Tunnel successfully established.
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is enabled
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Network device 'mesh-vpn' link is up
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' has link connectivity
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is setting up now
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is now up
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.162513] batman_adv: bat0: Adding interface: mesh-vpn
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.168110] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1420) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.192795] batman_adv: bat0: Interface activated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Unable to create local L2TP tunnel!
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Connection lost.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.348378] batman_adv: bat0: Interface deactivated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.notice netifd: Network device 'mesh-vpn' link is down
But more interesting is this "ps w" output.
root@FF-WTT-Citymanagement-Wetter:~# ps wwwww
PID USER VSZ STAT COMMAND
1 root 1568 S /sbin/procd
2 root 0 SW [kthreadd]
4 root 0 IW< [kworker/0:0H]
6 root 0 IW< [mm_percpu_wq]
7 root 0 SW [ksoftirqd/0]
43 root 0 SW [oom_reaper]
90 root 0 IW< [writeback]
92 root 0 SW [kcompactd0]
93 root 0 IW< [crypto]
95 root 0 IW< [kblockd]
116 root 0 IW< [watchdogd]
129 root 0 IW [kworker/0:1]
138 root 0 SW [kswapd0]
197 root 0 SW [spi0]
292 root 0 IW< [ipv6_addrconf]
300 root 0 IW< [kworker/0:1H]
379 root 0 SWN [jffs2_gcd_mtd4]
446 root 1216 S /sbin/ubusd
447 root 924 S /sbin/askfirst /usr/libexec/login.sh
464 root 1028 S /sbin/urngd
635 root 0 IW< [cfg80211]
659 root 0 IW< [bat_events]
769 root 1296 S /sbin/logd -S 64
821 dnsmasq 1348 S /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
836 root 0 IW [kworker/0:2]
855 root 1080 S /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
1022 root 932 S /usr/sbin/gluon-arp-limiter
1069 root 1764 S /sbin/netifd
1148 root 932 S /usr/sbin/uradvd -i local-node -p 2a11:6c6:6000:feed::/64
1388 root 1212 S udhcpc -p /var/run/udhcpc-br-wan.pid -s /lib/netifd/dhcp.script -f -t 0 -i br-wan -x hostname:FF-WTT-Citymanagement-Wetter -C -O 121
1391 root 1052 S odhcp6c -s /lib/netifd/dhcpv6.script -t120 br-wan
1396 root 1052 S odhcp6c -s /lib/netifd/dhcpv6.script -L -m30 -t120 br-client
1531 root 1124 S /usr/sbin/gluon-radv-filterd -i br-client -c RADV_FILTER -t 20
1587 root 932 S /usr/sbin/micrond /usr/lib/micron.d
1664 root 928 S /usr/sbin/sse-multiplexd
1760 root 1364 S /usr/sbin/uhttpd -f -h /lib/gluon/status-page/www -r FF-WTT-Citymanagement-Wetter -x /cgi-bin -t 60 -T 30 -k 20 -A 1 -n 32 -N 100 -R -p 0.0.0.0:80 -p [::]:80
1930 root 1352 S /usr/sbin/dnsmasq -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/gluon/wan-dnsmasq/resolv.conf
1941 root 1068 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
1950 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
1951 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
2049 root 1584 S /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
2073 root 1216 S< /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug -p 2a11:6c6:6000:feed:ff::1 -p 2a11:6c6:6000:feed:ff::2 -p 91.212.181.1
4095 root 0 IW [kworker/u2:0]
4204 root 2324 S /usr/bin/respondd -d /usr/lib/respondd -p 1001 -g ff02::2:1001 -i mesh0 -g ff05::2:1001 -i br-client -t 10
4519 root 1148 S /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
4563 root 1220 S -ash
4979 root 1068 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
4980 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
4981 nobody 1040 S /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
5038 root 1212 R ps wwwww
28583 root 0 IW [kworker/u2:3]
32214 root 0 IW [kworker/u2:2]
root@FF-WTT-Citymanagement-Wetter:~#
It looks like it gets started twice and this will last till next reboot (eventualy). I suspect that some changes how the Tunneldigger watchdog kills the process introduced this?
Kindly Regards Matthias
It looks like it gets started twice
Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)
Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it. (In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)
This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.
Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)
Indeed its the main process and one thread for each broker in the broker list.
Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it. (In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)
Well i stopped building firmware long ago in our community and someone else is doing it now. I refer to the old watchdog that was inside tunneldigger before it was accepted into gluon repo.
It included an "killall" wich i assume covered cases like this, the newer one touches only matching pids as it seems.
This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.
You mean this? https://github.com/freifunk-gluon/packages/blob/master/net/tunneldigger/files/tunneldigger.init#L79
I dont know how it comes that routers end up with two instances of tunneldigger, but it could be a starting point to look why it happens randomly after daily reboots.
Indeed its the main process and one thread for each broker in the broker list.
That's not what I see. We have 4 brokers and I see 3 processes total.
# ps | grep tunneldigger
1972 root 1068 S /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
1973 nobody 1040 S /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
1974 nobody 1040 S /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
Ok then this assumption was wrong. Today i have more than 100 routers affected by this problem and the only way to keep the gateways running while they get hammered is to block them by rate-limiter.
This is how it looks on one of the gateways.
I think i will modify that tunneldigger watchdog in gluon and add a killall to see if that fixes this mess.
Hey @RalfJung i just wanted to let you know, we have overcome this issue by patching the tunneldigger-watchdog that ships with Gluon. We couldnt investigate deeper and needed a quick solution, because lots of our refugee installations where hit by this problem. It seems that "killall -KILL tunneldigger" was our saviour here.
Kindly Regards
#!/usr/bin/lua
local uci = require('simple-uci').cursor()
local function restart_tunneldigger()
os.execute('logger -t tunneldigger-watchdog "Restarting Tunneldigger."')
os.execute('/etc/init.d/tunneldigger stop')
os.execute('sleep 1')
os.execute('killall -KILL tunneldigger')
os.execute('rm -f /var/run/tunneldigger.mesh-vpn.pid'
os.execute('sleep 5')
os.execute('/etc/init.d/tunneldigger start')
end
local function read_pid_file()
local pid_file = io.open('/var/run/tunneldigger.mesh-vpn.pid', 'r')
if not pid_file then
return nil
end
local pid = pid_file:read('*l')
pid_file:close()
return pid
end
local function has_mesh_vpn_neighbours()
local handle = io.popen('batctl o', 'r')
if not handle then
return false
end
for line in handle:lines() do
if line:find('mesh%-vpn') then
handle:close()
return true
end
end
handle:close()
return false
end
if uci:get_bool('tunneldigger', 'mesh_vpn', 'enabled') then
if io.popen('pgrep -x /usr/bin/tunneldigger'):read('*l') ~= read_pid_file() then
os.execute('logger -t tunneldigger-watchdog "Process-Pid does not match with pid-File."')
restart_tunneldigger()
return
end
if not has_mesh_vpn_neighbours() then
os.execute('logger -t tunneldigger-watchdog "No vpn-mesh neighbours found."')
restart_tunneldigger()
return
end
end
@valcryst have you or do you plan to upstream this change into gluon itself?
In saarland we have patched our own secondary tunneldigger watchdog to count the number of tunneldigger processes, and issue a reboot if there are too many of them. That also seems to help.
@RalfJung I've sent my feedback, but i'm no coder and maybe the experts on this find out the real cause that triggers this unwanted behaviour of multiple running tunneldiggers.