go-ethereum
go-ethereum copied to clipboard
p2p: make dial faster by streamlined discovery process
This PR improves the speed of Disc/v4 and Disc/v5 based discovery. It builds on https://github.com/ethereum/go-ethereum/pull/31592. To be rebased after merging that PR.
Our dial process is rate-limited, but until now, during startup, our discovery was too slow to serve dial. So the bottleneck was discovery, not the rate limit in dial, resulting in slow ramp-up of outgoing peer count.
This PR making discovery fast enough to serve dial's needs. The rate-limit of dial is still limiting the discovery process, so we are not risking being too aggressive in our discovery.
The PR adds:
- a pre-fetch buffer to discovery sources, eliminating slowdowns due to timeouts and rate mismatch between the two precesses
- multiple parallel lookups to make sure enough dial candidates are available all the time
First hour of dial progress before this PR: number of outgoing peers
First hour of dial progress after this PR (and https://github.com/ethereum/go-ethereum/pull/31592): number of outgoing peers
Discovery traffic before:
And after:
This PR increases the initial UDP traffic, mostly caused by the pre-fetch buffer. We might dial that down a bit, although I don't think the amount of traffic generated (~70 KB/s) is too much. On the long term, average discovery traffic should be the same.
@fjl I was thinking about the increased UDP traffic. There is the case of a very small network, smaller than maxpeers. For example a small test network with only a few nodes. In that case discovery might go full-steam continuously, looking for new dial candidates. In this specific case this PR would increase the background UDP traffic by roughly 3x.
If we think this is a problem, we might use discoveryParallelLookups=2 to be more conservative. Or should we maybe introduce an explicit "smallnet" flag and tune parameters accordingly? Or can we just expect people to set maxpeers wisely for a small network? I would say the last option would be fine, but we need some mods to make that work. See below.
A short description of what (supposed to) happen in detail. There are two cases:
- a small net using the large public disc-v4/v5 DHT.
- a small net using a private disc-v4/v5 DHT. Before https://github.com/ethereum/go-ethereum/pull/31592 these would behave differently, but after adding the filter on discv4, both should behave the same, except for the number of discarded peers during discovery.
In discovery, we only have duplicate filtering in lookup, by the seen cache mechanism. However, the lookupIterator will start new lookups, and these can return the same nodes again. There is no duplicate filtering between sequential lookups, or between parallel lookups as introduced here by discoveryParallelLookups. So discovery can go on full-steam, feeding "new" (which are the same as the old) dial candidates.
Dial pulls from this, and has its own duplicate detection through dialHistoryExpiration. When it finds a duplicate within the expiration time, it throws it away and pulls more candidates from discovery. It only throttles down when it starts to get closer to maxpeers, but that will not happen if maxpeers is significantly larger than the network size.
If instead maxpeers is set correctly, dial will try to slow down. It will reduce slots. But even with a single slot, it will pull and discard in a loop, making discovery go full steam. Should we introduce some time-based mechanism here? E.g. throttle if checkDial fails too many times in a row?
Testing on a small network (Sepolia), this is the first hour of operation before this PR:
And after this PR:
Separated from #31944 . Now this PR only contains the parallelized lookup part.
I think we have all of this in already.
Actually the parallelised lookup part is not yet merged. I'm reopening this to rebase what's still missing.