long route recovery on sleep device

Open iwanbk opened this issue 1 year ago • 1 comments

I tested the flutter mycelium macos app and i found that it got issue when the laptop put to sleep (in my case: >2 minutes of inactivity).

my test setup:

macos mycelium app on macbook pro
mycelium on Linux, on which i did ping

I expect that the ping will be recovered soon after the macbook woke up. But many times, it still not recovered even after 10 minutes.

I thought that the problem is on the flutter side, so i implement restart mycelium on wake up event.

Then i also tested it using my own public node, and the recovery times are significantly faster, many times only around 5 seconds. (after few tests, the recovery time getting longer. Restart mycelium on my public node solve the issue).

i copied the logs from https://github.com/threefoldtech/myceliumflut/issues/69#issuecomment-2387808820 here:

macos app side on minutes 28:26

"Frame error from TCP 192.168.1.6:55080 <-> 68.183.228.64:9651: Connection reset by peer (os error 54)

on minutes 28:31

Connected to new peer

my own public node it already connected to the public node

2024-10-02T07:28:31.439602Z  INFO mycelium::peer_manager: Accepted new inbound peer
2024-10-02T07:28:31.439650Z  INFO add_peer: mycelium::peer_manager: Added new peer peer.endpoint=Tcp [::ffff:36.80.99.115]:55229
2024-10-02T07:28:31.464380Z  INFO mycelium::router: Acquired route subnet=525:c933:ef2e:bfe7::/64 peer="TCP [::ffff:68.183.228.64]:9651 <-> [::ffff:36.80.99.115]:55229"

linux

But the route on linux not recovered instantly

024-10-02T07:30:38.961313Z  INFO mycelium::router: Acquired route subnet=525:c933:ef2e:bfe7::/64 peer="TCP 192.168.0.108:53200 <-> 68.183.228.64:9651"

Oct 02 '24 08:10 iwanbk

I'm not entirely sure what happens when the device goes to sleep. But basically mycelium sends a "HELLO" message on every connection every 20 seconds, to which the receiver replies with "IHU". If the receiver does not reply 2 times (basically last time an IHU was received is more than 43 second iirc), the connection is assumed to be dead, and either the peer is cleaned up (if we did not initiate the connection) or we try to reconnect. On top of this, for tcp there is also the tcp keepalive which should close the connection automatically should it not be sent anymore. This should also be detected, leading to a similar scenario as described above.

Once the node reconnects, if the process did not exit, it will likely have the exact same router ID. Depending on how long the node disconnected for, there might still be a source key for this router ID alive in the peers source key table. This will prevent announcements of the node if the metric is higher. In case it is, a seqno request should be sent which would cause the node to bumps its local seqno, and resent its routes.

In general, reconnection time is expected to be 5 seconds, as you see with your own public node. If you run that in debug mode, you might find some clue as to why it sometimes takes longer than 5 seconds. Also checking the metrics could help

Oct 18 '24 10:10 LeeSmet