js-libp2p Possible relay discovery defect

Version: 1.9.2

Platform: Darwin / Linux

Subsystem: Circuit relay transport

Severity: High

Description:

I have a p2p network of some relay and non-relay nodes. The non-relay nodes are configured to discover 3 relays, configured both with and without reservation concurrency. For some reason (maybe connection issues, or anything else - it's irrelevant for the main issue), the connection between nodes and relays breaks, and then after a time, the node is disconnected from all of the relays.

After checking the logs and digging into the implementation of the circuit relay transport, I found that there is an already-implemented relay discovery mechanism, and it works this way in simple terms:

When a relay disconnects, check the number of connected relays against discoverRelays
If not enough relays are connected, start the discovery process, and grab discovery lock, by setting a running flag in RelayDiscovery instance
After discovering relays, try to connect them until we reach discoverRelays
If enough relays are connected, release the lock, letting the discovery to be run again on relays disconnection

I think there is a critical issue here. If for any reason, the discovery doesn't discover enough relays, or we cannot connect the discovered ones, the relays count keeps under the discoverRelays, while the discovery is locked and cannot be run again. Over time, relays disconnect one by one, and we don't reconnect to them.

As an example:

Suppose a discoverRelays of 3
We connect to 3 relays
We disconnect from 2 for some reason
The discovery starts, and finds 4 new relays
For some reason, we only connect to 1 of 4, resulting in 1 new + 1 old connected relays, which is below 3 threshold, and so the discovery won't be run again
The process continues, until no relays are connected

I wonder why other possible solutions are not implemented, for example, changing the condition with which the lock is released, or running the discovery periodically.

Steps to reproduce the error:

It is clarified in description. There are some other configurations that may (or may not) affect the scenario, though. As an example, the auto dial should be disabled.

Aug 31 '24 09:08 mkermani144

I found out that the random walk part of the discovery is essentially an infinite async iterable (until the relevant signal is aborted), so it's not exactly the case that I described in my previous comment. By the way, shouldn't we add peer-routing to the list of the transport dependencies if discoverRelay > 0, as of identify? In addition, the peer store search part of the discovery is still run once, which I think, may cause problems. It's possibility is probably quite rare, though.

Sep 01 '24 10:09 mkermani144

Over time, relays disconnect one by one, and we don't reconnect to them. the random walk part of the discovery is essentially an infinite async iterable

Yes, so it should continue to discover peers until it eventually finds enough relays to bring the number with reservations up to discoverRelays.

It sounds like you're seeing something different?

By the way, shouldn't we add peer-routing to the list of the transport dependencies if discoverRelay > 0, as of identify?

I think that would make it easier for users to discover misconfigurations, yes.

In addition, the peer store search part of the discovery is still run once, which I think, may cause problems

Could you expand on this a little?

Sep 03 '24 14:09 achingbrain

It sounds like you're seeing something different?

What I'm saying here is that, in my opinion, it's wrong to suppose random walk always succeeds. If any error is thrown inside of the random walk, peer routing, or any related code, we get a failed when finding relays on the network log and the whole relay discovery process ends forever, and there is no mechanism to restart it automatically. Or maybe I'm wrong about it?

I think that would make it easier for users to discover misconfigurations, yes.

If it's as simple as adding it to the list of transport serviceDependences symbol , I can open a PR.

Could you expand on this a little?

Based on what I understood from the relay discovery code, it has two parts: It first tries to find some relays in the peer store, and then starts the random walk. Putting my answer to the first question aside and supposing random walk always succeeds, the peer store search is still only run once. It's neither an infinite iterable like random walk, nor is re-run through some other mechanism. We may not be able to connect to peer store relays at the moment of disconnection for a reason (say a simple network partition), but we may connect them a moment after if we try. In other word, there may be no need to run random walk for a long time at all: we simply need to give the peer store another chance.

Sep 03 '24 15:09 mkermani144

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

Sep 10 '24 00:09 github-actions[bot]

My previous comment contains the answers to @achingbrain questions.

Sep 10 '24 07:09 mkermani144