go-libp2p swarm: better backoff logic

We should try to distinguish between local failures and remote failures. At the very least, we should be resetting our backoffs when new links/routes come online.
We should probably be backing off on a per multiaddr basis, not a per peer basis (unless we establish a connection to the peer and it tells us to to away (need a new protocol for that, related to https://github.com/libp2p/go-libp2p/issues/238).

Came up in: https://github.com/libp2p/go-libp2p-kad-dht/issues/96

Oct 18 '17 20:10 Stebalien

Can we expose baseBackoffTime and maxBackoffTime? the default values are arbitrary and different applications may want different settings.

Jan 24 '18 23:01 mishto

Fair enough. Also, it looks like our backoff aren't actually exponential...

Jan 26 '18 04:01 Stebalien

This will be fixed in large refactor/simplification that's coming down the pipe.

Jan 26 '18 04:01 Stebalien

Note to self: Refund backoff "tries" after a period of time. Currently, if we go to max-backoff, wait an hour, and then fail a single dial, we'll wait the max backoff again. We should, instead, notice that an hour has passed and forget all the previous failures.

Code:

	now := time.Now()
	if sinceLast := now.Sub(bp.until); sinceLast > 0 {
		// Refund backoff time at the same rate.
		refund := int(math.Sqrt(float64((sinceLast - BackoffBase) / BackoffCoef)))
		if refund < bp.tries {
			bp.tries -= refund
		} else {
			bp.tries = 0
		}
	}

Not going to do this now because we have so many other changes in the pipeline and we may want to discuss this.

Jan 26 '18 04:01 Stebalien

Sounds good, thanks.

Jan 29 '18 16:01 mishto

Working through all the different backoff cases:

Backoff trying to find a peer.
- This definitely belongs down in the DHT, or as a wrapper around the DHT.
Backoff a port/ip because a TCP dial failed.
- This could happen inside the transport or inside the swarm itself.
  - If it happens inside the transport, we'd need a shared backoff module for backing off dialing multiaddrs with certain prefixes.
  - If it happens inside the swarm, we'd need some way to report the backoff to the swarm. We'd probably do this by returning a special error.
Backoff an IP when we get a "no route to IP" error.
- Same as above.
Backoff a port/ip/peer triple when we end up dialing the wrong peer.
- Same as above.
Backoff a peer/transport when we fail to negotiate a muxer/security transport.
- This is an interesting case. Really, we want to backoff the entire peer for all transports using the upgrader upgrader. This is a case where applying the backoff from within the transport is really the only solution that makes sense (as the transport knows what sub-transports it uses).

Mar 03 '20 06:03 Stebalien

Status: While @petar's patches are likely the right way to go in the future, they introduce quite a few new interfaces that'll need to be discussed. In the interest of getting a fast fix in, @willscott is implementing (#191) a dumb version that just backs off full addresses inside the swarm itself without changing core libp2p interfaces.

That gives us some breathing room.

Apr 01 '20 22:04 Stebalien