petals icon indicating copy to clipboard operation
petals copied to clipboard

Server absent from health.petals.ml

Open Vahe1994 opened this issue 2 years ago • 2 comments

Server under the NAT stop being detected by others after several days working fine. The server stopped being shown in health.petals and requests to it stopped coming. There isn't any errors or logs that indicates that something went wrong. Last logs from the server are attached below image

Vahe1994 avatar Feb 07 '23 13:02 Vahe1994

Looks like some issue with long-term circuit relays -- OR it could be caused by temporarily severed internet connection on your side.

If latter is the case, a typical failure scenario looks like this:

  • your server connects to the swarm, finds a short-list of several peers
    • it periodically pings these peers to check if they are still there
    • if a peer does not respond to a ping, it is likely dead and therefore it should be removed from the list of peers
  • your server gets disconnected from the internet
    • after some time, it begins pinging others to check if they are available (they are not b/c no internet)
    • since you see them as banned, you remove them from the swarm
    • once you've banned all other peers, you are now a swarm of 1, independent of everyone else
    • therefore, you are not reachable from public swarm and the requests of clients never reach you

justheuristic avatar Feb 09 '23 14:02 justheuristic

[based on a quick chat with @Vahe1994 ]

  • there was indeed a prolonged internet outage at the time
  • we can, in principle, fix this by introducing a background task that checks if you have any peers, and, if not, tries to connect to INITIAL_PEERS or some previously connected peers until you have at least 1 peer. This does not fix local network outage.

Suggested solution: during the next sprint (T + 3-4 weeks), when we work on means to run Petals server in background, we can add a check that the server is present on the health.petals.ml dashboard. If it isn't (but the dashboard is up), restart the server.

justheuristic avatar Feb 09 '23 14:02 justheuristic