[Bug]: some clients does not checkin properly
Contact Details
What happened?
I built a small mesh net with about 8 nodes. Some of nodes (in China, perhaps behind the GFW) can join the network with no problem, all nodes can ping each other with no problem. But after a while, all chinese nodes status will first become to warning and then become to error. I saw the netclient.service logs in error node, it is different from normal node. If I manuall do the netclient pull then the error node will become healthy again but for a while become warning and error again. I don't know what the problem is. Now I wrote a shell loop run netclient pull, but it's not a good solution. Could some one help me to solve this problem?
logs of normal working nodes:
● netclient.service - Netclient Daemon
Loaded: loaded (/etc/systemd/system/netclient.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-30 01:46:40 CEST; 12h ago
Docs: https://docs.netmaker.org
https://k8s.netmaker.org
Main PID: 1007729 (netclient)
Tasks: 10 (limit: 9509)
Memory: 18.1M
CPU: 11.231s
CGroup: /system.slice/netclient.service
└─1007729 /sbin/netclient daemon
Mar 30 14:04:42 debian11 netclient[1007729]: [netclient] 2022-03-30 14:04:42 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:04:43 debian11 netclient[1007729]: [netclient] 2022-03-30 14:04:43 received peer update for node hard-zombie E3UAQeqA
Mar 30 14:08:45 debian11 netclient[1007729]: [netclient] 2022-03-30 14:08:45 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:09:48 debian11 netclient[1007729]: [netclient] 2022-03-30 14:09:48 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:11:50 debian11 netclient[1007729]: [netclient] 2022-03-30 14:11:50 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:14:51 debian11 netclient[1007729]: [netclient] 2022-03-30 14:14:51 received peer update for node hard-zombie E3UAQeqA
Mar 30 14:14:53 debian11 netclient[1007729]: [netclient] 2022-03-30 14:14:53 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:15:54 debian11 netclient[1007729]: [netclient] 2022-03-30 14:15:54 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:16:57 debian11 netclient[1007729]: [netclient] 2022-03-30 14:16:57 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:17:58 debian11 netclient[1007729]: [netclient] 2022-03-30 14:17:58 received peer update for node de-pve-debian11 wg-mesh
Version
v0.12.2
What OS are you using?
Linux
Relevant log output
logs of error nodes:
● netclient.service - Netclient Daemon
Loaded: loaded (/etc/systemd/system/netclient.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-30 19:29:40 CST; 48min ago
Docs: https://docs.netmaker.org
https://k8s.netmaker.org
Main PID: 78184 (netclient)
Tasks: 9 (limit: 9510)
Memory: 18.1M
CPU: 1.569s
CGroup: /system.slice/netclient.service
└─78184 /sbin/netclient daemon
Mar 30 19:29:40 debian11 systemd[1]: Started Netclient Daemon.
Mar 30 19:29:40 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:40 pulling latest config for E3UAQeqA
Mar 30 19:29:45 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:45 waiting for interface...
Mar 30 19:29:45 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:45 interface ready - netclient.. ENGAGE
Mar 30 19:29:47 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:47 pulling latest config for wg-mesh
Mar 30 19:29:53 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:53 waiting for interface...
Mar 30 19:29:53 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:53 interface ready - netclient.. ENGAGE
Mar 30 19:29:55 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:55 started comms network daemon, E3UAQeqA
Mar 30 19:29:55 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:55 netclient daemon started for network: E3UAQeqA
Contributing guidelines
- [X] Yes, I did.
fI have the same fault
same problem, netclinet does not pull config automatically in v12.2
ive discovered a similar issue with one of our 'server 2012 r2' machines, our issue i have found is whenever the node loses internet access and disconnects from the MQTT, and then it reconnects when the internet returns, its not reconnecting properly so the node shows as offline even though, you can ping the node no problem, i just simply restart the netclient service and it returns to normal no problem
Some of nodes (in China, perhaps behind the GFW) can join the network with no problem, all nodes can ping each other with no problem. But after a while, all chinese nodes status will first become to warning and then become to error. I saw the netclient.service logs in error node, it is different from normal node.
Can confirm. I met the same problem.
I ended up just adding a system timer similar to how it was done in v0.9.x, which for some reason is not present any more. Commit that removed it as part of #645: https://github.com/gravitl/netmaker/commit/443ed80e4d27d208134795e603aa8f166f7af017
Fix:
sudo nano /etc/systemd/system/netclient-pull.service
[Unit]
Description=Network Check
Wants=netclient.timer
[Service]
Type=simple
ExecStart=/usr/sbin/netclient pull -n all
[Install]
WantedBy=multi-user.target
sudo nano /etc/systemd/system/netclient.timer
[Unit]
Description=Calls the Netmaker Mesh Client Service
Requires=netclient.service
[Timer]
Unit=netclient-pull.service
OnCalendar=*:*:0/15
[Install]
WantedBy=timers.target
sudo systemctl enable netclient.timer
sudo systemctl start netclient.timer
#841 might be related, but I didn't have the mentioned logs with "invalid message from broker".
same with netmaker server 0.14.1 running on docker. it worked perfectly after addin 4 nodes. Issues began when i added a windows 10 node (Sever network slowdown on the machine that had to be removed). Since then almost every node i add brings this issue. Restarts and reinstalls of client does not work. Will try a reinstall of server if issues persist, worsen or inhibit my use case