go-libp2p swarm: dial priorities

So, it turns out we really need dial priorities, given the dial limits we have in place.

The behavior I'm noticing is:

We fire off a FindProviders.
This launches a ton of dials.
We get back a stream of providers.
We try to connect to one of these providers.
We get stuck behind the dials from 2. This finally success almost exactly a minute later because that's when all the dials from 2 time out.

One part if this is reducing the number of dials in 2 however, that won't help systems like our gateways handling a bunch of parallel requests.

Another way to improve this is to introduce priorities. That is, prioritize dialing some peers over others. That way, DHT requests don't block bitswap requests.

The ideal solution is not having to limit the number of simultaneous dials but we're always going to be somewhat restricted here, even with UDP based protocols, if we want to stop killing routers.

Jan 26 '19 00:01 Stebalien

Note: I've reproduced this locally.

Jan 26 '19 00:01 Stebalien

Where does that one minute dial timeout come from? In QUIC, I’m using a handshake timeout of 10 seconds, and I’m even considering reducing that in case the peer is offline (and I don’t receive any packet). One minute seems way too long to be useful as a dial timeout.

Jan 26 '19 01:01 marten-seemann

The default timeout is in libp2p-transport: https://github.com/libp2p/go-libp2p-transport/blob/c45cca89b916558b5aff9cb46d2d903b3cc78e3a/transport.go#L18

Jan 26 '19 01:01 magik6k

@marten-seemann it's the timeout for dialing a peer, including all transports, waiting, etc.

Jan 26 '19 19:01 Stebalien

(we have a 5 second TCP connect timeout)

Jan 26 '19 19:01 Stebalien

Do we have any data that suggests that we actually need that much time? Running a cryptographic handshake and negotiating a stream muxer should not take remotely close to 55 seconds. And if it actually does, I don't think this is a peer we want to be connected to.

Jan 28 '19 02:01 marten-seemann

Hm. Actually, I think we're talking about multiple timeouts here. I agree the one @magik6k referenced should probably be shorter in many cases. However, that was really intended to be an absolute maximum where individual transports would try to estimate a better bound based on the expected number of round trips.

The timeout I'm talking about is https://github.com/libp2p/go-libp2p-net/blob/11b9dd9287bf6b9944c4e77d941b4771a6179678/timeouts.go#L11. Really, it's an upper bound on how long a service/user might care to wait when dialing a peer.

Check out libp2p/go-libp2p#1547.

Jan 28 '19 16:01 Stebalien