swarm: better backoff logic
- We should try to distinguish between local failures and remote failures. At the very least, we should be resetting our backoffs when new links/routes come online.
- We should probably be backing off on a per multiaddr basis, not a per peer basis (unless we establish a connection to the peer and it tells us to to away (need a new protocol for that, related to https://github.com/libp2p/go-libp2p/issues/238).
Came up in: https://github.com/libp2p/go-libp2p-kad-dht/issues/96
Can we expose baseBackoffTime and maxBackoffTime? the default values are arbitrary and different applications may want different settings.
Fair enough. Also, it looks like our backoff aren't actually exponential...
This will be fixed in large refactor/simplification that's coming down the pipe.
Note to self: Refund backoff "tries" after a period of time. Currently, if we go to max-backoff, wait an hour, and then fail a single dial, we'll wait the max backoff again. We should, instead, notice that an hour has passed and forget all the previous failures.
Code:
now := time.Now()
if sinceLast := now.Sub(bp.until); sinceLast > 0 {
// Refund backoff time at the same rate.
refund := int(math.Sqrt(float64((sinceLast - BackoffBase) / BackoffCoef)))
if refund < bp.tries {
bp.tries -= refund
} else {
bp.tries = 0
}
}
Not going to do this now because we have so many other changes in the pipeline and we may want to discuss this.
Sounds good, thanks.
Working through all the different backoff cases:
- Backoff trying to find a peer.
- This definitely belongs down in the DHT, or as a wrapper around the DHT.
- Backoff a port/ip because a TCP dial failed.
- This could happen inside the transport or inside the swarm itself.
- If it happens inside the transport, we'd need a shared backoff module for backing off dialing multiaddrs with certain prefixes.
- If it happens inside the swarm, we'd need some way to report the backoff to the swarm. We'd probably do this by returning a special error.
- This could happen inside the transport or inside the swarm itself.
- Backoff an IP when we get a "no route to IP" error.
- Same as above.
- Backoff a port/ip/peer triple when we end up dialing the wrong peer.
- Same as above.
- Backoff a peer/transport when we fail to negotiate a muxer/security transport.
- This is an interesting case. Really, we want to backoff the entire peer for all transports using the upgrader upgrader. This is a case where applying the backoff from within the transport is really the only solution that makes sense (as the transport knows what sub-transports it uses).
Status: While @petar's patches are likely the right way to go in the future, they introduce quite a few new interfaces that'll need to be discussed. In the interest of getting a fast fix in, @willscott is implementing (#191) a dumb version that just backs off full addresses inside the swarm itself without changing core libp2p interfaces.
That gives us some breathing room.