swarm: dial priorities
So, it turns out we really need dial priorities, given the dial limits we have in place.
The behavior I'm noticing is:
- We fire off a FindProviders.
- This launches a ton of dials.
- We get back a stream of providers.
- We try to connect to one of these providers.
- We get stuck behind the dials from 2. This finally success almost exactly a minute later because that's when all the dials from 2 time out.
One part if this is reducing the number of dials in 2 however, that won't help systems like our gateways handling a bunch of parallel requests.
Another way to improve this is to introduce priorities. That is, prioritize dialing some peers over others. That way, DHT requests don't block bitswap requests.
The ideal solution is not having to limit the number of simultaneous dials but we're always going to be somewhat restricted here, even with UDP based protocols, if we want to stop killing routers.
Note: I've reproduced this locally.
Where does that one minute dial timeout come from? In QUIC, I’m using a handshake timeout of 10 seconds, and I’m even considering reducing that in case the peer is offline (and I don’t receive any packet). One minute seems way too long to be useful as a dial timeout.
The default timeout is in libp2p-transport: https://github.com/libp2p/go-libp2p-transport/blob/c45cca89b916558b5aff9cb46d2d903b3cc78e3a/transport.go#L18
@marten-seemann it's the timeout for dialing a peer, including all transports, waiting, etc.
(we have a 5 second TCP connect timeout)
Do we have any data that suggests that we actually need that much time? Running a cryptographic handshake and negotiating a stream muxer should not take remotely close to 55 seconds. And if it actually does, I don't think this is a peer we want to be connected to.
Hm. Actually, I think we're talking about multiple timeouts here. I agree the one @magik6k referenced should probably be shorter in many cases. However, that was really intended to be an absolute maximum where individual transports would try to estimate a better bound based on the expected number of round trips.
The timeout I'm talking about is https://github.com/libp2p/go-libp2p-net/blob/11b9dd9287bf6b9944c4e77d941b4771a6179678/timeouts.go#L11. Really, it's an upper bound on how long a service/user might care to wait when dialing a peer.
Check out libp2p/go-libp2p#1547.