bones icon indicating copy to clipboard operation
bones copied to clipboard

Matchmaker eventually stops accepting connections after being deployed

Open MaxCWhitehead opened this issue 1 year ago • 3 comments

I've had test matchmaker deployed for ~4 days, Jumpy now times out when trying to connect. I think I've hit this before and typically restarting the container seems to resolve, going to try to debug before restarting.

2024-08-02T03:36:26.643321Z  INFO bones_framework::networking::online: Starting search for online game with player count 2
2024-08-02T03:36:26.643897Z  INFO bones_framework::networking::online: Connecting to online matchmaker
2024-08-02T03:36:26.974294Z  INFO magic_ep{me=lgui4qsh7nmonau5}:magicsock{me=lgui4qsh7nmonau5}:actor: iroh_net::magicsock: home is now relay https://use1-1.relay.iroh.network./, was None
2024-08-02T03:36:26.974384Z  INFO pkarr_publish{me=lgui4qsh7nmonau5}: iroh_net::discovery::pkarr: Publish node info to pkarr relay_url=Some("https://use1-1.relay.iroh.network./")
2024-08-02T03:36:26.974468Z  INFO magic_ep{me=lgui4qsh7nmonau5}:magicsock{me=lgui4qsh7nmonau5}:relay-actor: iroh_net::magicsock::relay_actor: adding connection to relay: https://use1-1.relay.iroh.network./ for home-keep-alive
2024-08-02T03:36:26.998492Z  INFO discovery{me=lgui4qsh7nmonau5 node=e7ubfz7cuqcx4xpr}:add_node_addr{me=lgui4qsh7nmonau5}:add_node_addr{node=e7ubfz7cuqcx4xpr}: iroh_net::magicsock::node_map: inserting new node in NodeMap node=e7ubfz7cuqcx4xpr relay_url=Some(RelayUrl("https://use1-1.relay.iroh.network./")) source=discovery
2024-08-02T03:36:26.999192Z  INFO poll_send{me=lgui4qsh7nmonau5}:get_send_addrs{node=e7ubfz7cuqcx4xpr}: iroh_net::magicsock::node_map::node_state: new connection type typ=relay
2024-08-02T03:36:37.000977Z  WARN bones_framework::networking::online: Online Game Search failed: failed connecting to provider

Caused by:
    timed out

MaxCWhitehead avatar Aug 02 '24 03:08 MaxCWhitehead

Haven't identified the issue yet here with additional logging - setup a client that attempts connection to mathcmaker every 5 seconds, and monitoring with tokio-console. Hoping to pin down exact timestamp that it breaks to see if anything more subtle happens in logs. Otherwise maybe will see a change in the async state that points towards an issue.

MaxCWhitehead avatar Aug 18 '24 18:08 MaxCWhitehead

Thanks for your patience on this! And many thanks for the 1G of logfile you sent us. It allowed me to figure out what was going on and if I have it right this should be the fix: https://github.com/n0-computer/iroh/pull/2782

It makes sense that this happens only with your dev matchmaker: because it depends on not much traffic happening. A matchmaker that constantly has remote nodes connecting and disconnecting would be much less likely to run into this.

flub avatar Oct 03 '24 17:10 flub

@flub

Yeah that checks out - I was probing it repeatedly for a while to see if could pin point a timestamp of exact failure, and during this test it never failed. Definitely points to reproducing when idle.

Thank you very much for digging into this and doing autopsy on the logs. Nice detective work!

Will circle back in the future once this is merged + we deploy it and can close the loop on confirming the fix.

MaxCWhitehead avatar Oct 04 '24 01:10 MaxCWhitehead

Updated bones version to 0.27 - thanks again @flub!

MaxCWhitehead avatar Oct 26 '24 19:10 MaxCWhitehead