long route recovery on sleep device
I tested the flutter mycelium macos app and i found that it got issue when the laptop put to sleep (in my case: >2 minutes of inactivity).
my test setup:
- macos mycelium app on macbook pro
- mycelium on Linux, on which i did
ping
I expect that the ping will be recovered soon after the macbook woke up.
But many times, it still not recovered even after 10 minutes.
I thought that the problem is on the flutter side, so i implement restart mycelium on wake up event.
Then i also tested it using my own public node, and the recovery times are significantly faster, many times only around 5 seconds.
(after few tests, the recovery time getting longer. Restart mycelium on my public node solve the issue).
i copied the logs from https://github.com/threefoldtech/myceliumflut/issues/69#issuecomment-2387808820 here:
macos app side on minutes 28:26
"Frame error from TCP 192.168.1.6:55080 <-> 68.183.228.64:9651: Connection reset by peer (os error 54)
on minutes 28:31
Connected to new peer
my own public node it already connected to the public node
2024-10-02T07:28:31.439602Z INFO mycelium::peer_manager: Accepted new inbound peer
2024-10-02T07:28:31.439650Z INFO add_peer: mycelium::peer_manager: Added new peer peer.endpoint=Tcp [::ffff:36.80.99.115]:55229
2024-10-02T07:28:31.464380Z INFO mycelium::router: Acquired route subnet=525:c933:ef2e:bfe7::/64 peer="TCP [::ffff:68.183.228.64]:9651 <-> [::ffff:36.80.99.115]:55229"
linux
- But the
routeon linux not recovered instantly
024-10-02T07:30:38.961313Z INFO mycelium::router: Acquired route subnet=525:c933:ef2e:bfe7::/64 peer="TCP 192.168.0.108:53200 <-> 68.183.228.64:9651"
I'm not entirely sure what happens when the device goes to sleep. But basically mycelium sends a "HELLO" message on every connection every 20 seconds, to which the receiver replies with "IHU". If the receiver does not reply 2 times (basically last time an IHU was received is more than 43 second iirc), the connection is assumed to be dead, and either the peer is cleaned up (if we did not initiate the connection) or we try to reconnect. On top of this, for tcp there is also the tcp keepalive which should close the connection automatically should it not be sent anymore. This should also be detected, leading to a similar scenario as described above.
Once the node reconnects, if the process did not exit, it will likely have the exact same router ID. Depending on how long the node disconnected for, there might still be a source key for this router ID alive in the peers source key table. This will prevent announcements of the node if the metric is higher. In case it is, a seqno request should be sent which would cause the node to bumps its local seqno, and resent its routes.
In general, reconnection time is expected to be 5 seconds, as you see with your own public node. If you run that in debug mode, you might find some clue as to why it sometimes takes longer than 5 seconds. Also checking the metrics could help