go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

swarm: better backoff logic

Open Stebalien opened this issue 8 years ago • 7 comments

  1. We should try to distinguish between local failures and remote failures. At the very least, we should be resetting our backoffs when new links/routes come online.
  2. We should probably be backing off on a per multiaddr basis, not a per peer basis (unless we establish a connection to the peer and it tells us to to away (need a new protocol for that, related to https://github.com/libp2p/go-libp2p/issues/238).

Came up in: https://github.com/libp2p/go-libp2p-kad-dht/issues/96

Stebalien avatar Oct 18 '17 20:10 Stebalien

Can we expose baseBackoffTime and maxBackoffTime? the default values are arbitrary and different applications may want different settings.

mishto avatar Jan 24 '18 23:01 mishto

Fair enough. Also, it looks like our backoff aren't actually exponential...

Stebalien avatar Jan 26 '18 04:01 Stebalien

This will be fixed in large refactor/simplification that's coming down the pipe.

Stebalien avatar Jan 26 '18 04:01 Stebalien

Note to self: Refund backoff "tries" after a period of time. Currently, if we go to max-backoff, wait an hour, and then fail a single dial, we'll wait the max backoff again. We should, instead, notice that an hour has passed and forget all the previous failures.

Code:

	now := time.Now()
	if sinceLast := now.Sub(bp.until); sinceLast > 0 {
		// Refund backoff time at the same rate.
		refund := int(math.Sqrt(float64((sinceLast - BackoffBase) / BackoffCoef)))
		if refund < bp.tries {
			bp.tries -= refund
		} else {
			bp.tries = 0
		}
	}

Not going to do this now because we have so many other changes in the pipeline and we may want to discuss this.

Stebalien avatar Jan 26 '18 04:01 Stebalien

Sounds good, thanks.

mishto avatar Jan 29 '18 16:01 mishto

Working through all the different backoff cases:

  • Backoff trying to find a peer.
    • This definitely belongs down in the DHT, or as a wrapper around the DHT.
  • Backoff a port/ip because a TCP dial failed.
    • This could happen inside the transport or inside the swarm itself.
      • If it happens inside the transport, we'd need a shared backoff module for backing off dialing multiaddrs with certain prefixes.
      • If it happens inside the swarm, we'd need some way to report the backoff to the swarm. We'd probably do this by returning a special error.
  • Backoff an IP when we get a "no route to IP" error.
    • Same as above.
  • Backoff a port/ip/peer triple when we end up dialing the wrong peer.
    • Same as above.
  • Backoff a peer/transport when we fail to negotiate a muxer/security transport.
    • This is an interesting case. Really, we want to backoff the entire peer for all transports using the upgrader upgrader. This is a case where applying the backoff from within the transport is really the only solution that makes sense (as the transport knows what sub-transports it uses).

Stebalien avatar Mar 03 '20 06:03 Stebalien

Status: While @petar's patches are likely the right way to go in the future, they introduce quite a few new interfaces that'll need to be discussed. In the interest of getting a fast fix in, @willscott is implementing (#191) a dumb version that just backs off full addresses inside the swarm itself without changing core libp2p interfaces.

That gives us some breathing room.

Stebalien avatar Apr 01 '20 22:04 Stebalien