netmaker icon indicating copy to clipboard operation
netmaker copied to clipboard

[Bug]: some clients does not checkin properly

Open FaintGhost opened this issue 3 years ago • 7 comments

Contact Details

[email protected]

What happened?

I built a small mesh net with about 8 nodes. Some of nodes (in China, perhaps behind the GFW) can join the network with no problem, all nodes can ping each other with no problem. But after a while, all chinese nodes status will first become to warning and then become to error. I saw the netclient.service logs in error node, it is different from normal node. If I manuall do the netclient pull then the error node will become healthy again but for a while become warning and error again. I don't know what the problem is. Now I wrote a shell loop run netclient pull, but it's not a good solution. Could some one help me to solve this problem?

logs of normal working nodes:

● netclient.service - Netclient Daemon
     Loaded: loaded (/etc/systemd/system/netclient.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-03-30 01:46:40 CEST; 12h ago
       Docs: https://docs.netmaker.org
             https://k8s.netmaker.org
   Main PID: 1007729 (netclient)
      Tasks: 10 (limit: 9509)
     Memory: 18.1M
        CPU: 11.231s
     CGroup: /system.slice/netclient.service
             └─1007729 /sbin/netclient daemon

Mar 30 14:04:42 debian11 netclient[1007729]: [netclient] 2022-03-30 14:04:42 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:04:43 debian11 netclient[1007729]: [netclient] 2022-03-30 14:04:43 received peer update for node hard-zombie E3UAQeqA
Mar 30 14:08:45 debian11 netclient[1007729]: [netclient] 2022-03-30 14:08:45 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:09:48 debian11 netclient[1007729]: [netclient] 2022-03-30 14:09:48 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:11:50 debian11 netclient[1007729]: [netclient] 2022-03-30 14:11:50 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:14:51 debian11 netclient[1007729]: [netclient] 2022-03-30 14:14:51 received peer update for node hard-zombie E3UAQeqA
Mar 30 14:14:53 debian11 netclient[1007729]: [netclient] 2022-03-30 14:14:53 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:15:54 debian11 netclient[1007729]: [netclient] 2022-03-30 14:15:54 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:16:57 debian11 netclient[1007729]: [netclient] 2022-03-30 14:16:57 received peer update for node de-pve-debian11 wg-mesh
Mar 30 14:17:58 debian11 netclient[1007729]: [netclient] 2022-03-30 14:17:58 received peer update for node de-pve-debian11 wg-mesh

Version

v0.12.2

What OS are you using?

Linux

Relevant log output

logs of error nodes:

● netclient.service - Netclient Daemon
     Loaded: loaded (/etc/systemd/system/netclient.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-03-30 19:29:40 CST; 48min ago
       Docs: https://docs.netmaker.org
             https://k8s.netmaker.org
   Main PID: 78184 (netclient)
      Tasks: 9 (limit: 9510)
     Memory: 18.1M
        CPU: 1.569s
     CGroup: /system.slice/netclient.service
             └─78184 /sbin/netclient daemon

Mar 30 19:29:40 debian11 systemd[1]: Started Netclient Daemon.
Mar 30 19:29:40 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:40 pulling latest config for  E3UAQeqA
Mar 30 19:29:45 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:45 waiting for interface...
Mar 30 19:29:45 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:45 interface ready - netclient.. ENGAGE
Mar 30 19:29:47 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:47 pulling latest config for  wg-mesh
Mar 30 19:29:53 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:53 waiting for interface...
Mar 30 19:29:53 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:53 interface ready - netclient.. ENGAGE
Mar 30 19:29:55 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:55 started comms network daemon,  E3UAQeqA
Mar 30 19:29:55 debian11 netclient[78184]: [netclient] 2022-03-30 19:29:55 netclient daemon started for network:  E3UAQeqA

Contributing guidelines

  • [X] Yes, I did.

FaintGhost avatar Mar 30 '22 12:03 FaintGhost

fI have the same fault

goldsoft8888 avatar Mar 31 '22 07:03 goldsoft8888

same problem, netclinet does not pull config automatically in v12.2

cx9208 avatar Apr 08 '22 02:04 cx9208

ive discovered a similar issue with one of our 'server 2012 r2' machines, our issue i have found is whenever the node loses internet access and disconnects from the MQTT, and then it reconnects when the internet returns, its not reconnecting properly so the node shows as offline even though, you can ping the node no problem, i just simply restart the netclient service and it returns to normal no problem

si458 avatar Apr 25 '22 13:04 si458

Some of nodes (in China, perhaps behind the GFW) can join the network with no problem, all nodes can ping each other with no problem. But after a while, all chinese nodes status will first become to warning and then become to error. I saw the netclient.service logs in error node, it is different from normal node.

Can confirm. I met the same problem.

ElectronicElephant avatar Apr 26 '22 11:04 ElectronicElephant

I ended up just adding a system timer similar to how it was done in v0.9.x, which for some reason is not present any more. Commit that removed it as part of #645: https://github.com/gravitl/netmaker/commit/443ed80e4d27d208134795e603aa8f166f7af017

Fix:

sudo nano /etc/systemd/system/netclient-pull.service

[Unit]
Description=Network Check
Wants=netclient.timer
[Service]
Type=simple
ExecStart=/usr/sbin/netclient pull -n all
[Install]
WantedBy=multi-user.target

sudo nano /etc/systemd/system/netclient.timer

[Unit]
Description=Calls the Netmaker Mesh Client Service
Requires=netclient.service
[Timer]
Unit=netclient-pull.service
OnCalendar=*:*:0/15
[Install]
WantedBy=timers.target

sudo systemctl enable netclient.timer

sudo systemctl start netclient.timer

jacobped avatar May 01 '22 10:05 jacobped

#841 might be related, but I didn't have the mentioned logs with "invalid message from broker".

jacobped avatar May 01 '22 10:05 jacobped

same with netmaker server 0.14.1 running on docker. it worked perfectly after addin 4 nodes. Issues began when i added a windows 10 node (Sever network slowdown on the machine that had to be removed). Since then almost every node i add brings this issue. Restarts and reinstalls of client does not work. Will try a reinstall of server if issues persist, worsen or inhibit my use case

Nexxus-LMT avatar Jun 03 '22 11:06 Nexxus-LMT