celestia-node icon indicating copy to clipboard operation
celestia-node copied to clipboard

feat(discovery): discover triggered too frequently

Open guillaumemichel opened this issue 1 year ago • 4 comments

In the case a node doesn't have its quota of peers, it will send a new discover request every second. There are no guarantees that the discover request can complete within 1 second.

https://github.com/celestiaorg/celestia-node/blob/accb0589d2de42a44de43e0974610d3a0186976d/share/p2p/discovery/discovery.go#L225-L229

This interval seems too aggressive and should probably be increased (e.g to 1 or even 5 minutes?)

guillaumemichel avatar Jul 03 '24 13:07 guillaumemichel

@guillaumemichel i agree that the interval for retrying is a bit too aggressive and could be increased, but the actual deadline to FindPeers is a minute.

https://github.com/celestiaorg/celestia-node/blob/accb0589d2de42a44de43e0974610d3a0186976d/share/p2p/discovery/discovery.go#L286-L293

renaynay avatar Jul 04 '24 09:07 renaynay

So basically there will always be a lookup running until enough peers are discovered.

guillaumemichel avatar Jul 04 '24 09:07 guillaumemichel

So basically there will always be a lookup running until enough peers are discovered.

IIRC, that was the intention, but as you mentioned that might be too aggressive

Wondertan avatar Jul 04 '24 12:07 Wondertan

Sometimes, a node fails to discover any peers on startup. Without discovered peers, the node is unable to perform most of its P2P logic. The downside of a longer cooldown is that it will cause the node application to halt for the cooldown duration in such cases. If some peers are discovered, the node should still aim to find at least 5 (our default) to rely less on the performance and availability of a single peer.

The idea behind aggressive retries is to bootstrap node into a stable network condition as soon as possible, perhaps at the cost of more resources spent on aggressive discovery. So I think the defaults should stay low.

I think it might be valuable for some users to have the ability to increase retry/timeout values if they are less concerned about the node being connected to the FN network.

walldiss avatar Jul 17 '24 19:07 walldiss