go-bitswap icon indicating copy to clipboard operation
go-bitswap copied to clipboard

Reevaluate MessageQueue dial logic

Open dirkmc opened this issue 5 years ago • 3 comments

Currently the MessageQueue will wait up to 10 minutes for a dial to a peer to succeed: https://github.com/ipfs/go-bitswap/blob/5030b8dd39738778c2cc621a1d7b35cca010ba28/internal/messagequeue/messagequeue.go#L471-L477

However IIUC we only ever try to send messages to peers that we are already connected to.

Proposal:

  • remove the retry logic in MessageQueue
  • if the dial times out, remove the peer from the session

dirkmc avatar Feb 03 '20 15:02 dirkmc

Do we remove peers from sessions when they disconnect?

Note: We're still probably going to want a retry in the message sender as the first connection we try may actually be dead, even if we have a separate connection. Unfortunately, we can easily get into a state where:

  1. Our connection gets cut (e.g., some nasty middlebox sens a reset or the other peer just decides to kill it).
  2. The other peer reconnects before we notice (1). We now have two connections.
  3. The other peer asks us for a block using the new connection.
  4. We try to send it back on the old connection.

Currently, we'll fail at step 4 and retry. If we removed the retry, we'd just drop the message entirely.


Really, we need to re-evaluate how we handle "connections" and state. (https://github.com/ipfs/specs/pull/201)

Stebalien avatar Feb 05 '20 00:02 Stebalien

Do we remove peers from sessions when they disconnect?

Yes, the PeerManager tells the Session when a peer disconnects

  1. The other peer asks us for a block using the new connection.
  2. We try to send it back on the old connection.

Note that the reconnect code referred to above is on the client side: the 10 minute dial timeout is for when we're sending a want.

MessageQueue will attempt to send the message 10 times. So in the scenario you're describing of our connection failing, and the other side opening a new connection, we would expect:

  • the first send attempt to fail
  • the second attempt to reconnect (using the new connection)
  • successful send

Is that correct?

dirkmc avatar Feb 05 '20 14:02 dirkmc

Note that the reconnect code referred to above is on the client side: the 10 minute dial timeout is for when we're sending a want.

Oh. Never mind.

Is that correct?

Probably? Technically, we could have a couple of failures (multiple semi-dead connections). Really, I'd try several times while we think we're connected.

At the very least, we can remove the connect logic. Really, we can (and probably should?) use https://godoc.org/github.com/libp2p/go-libp2p-core/network#WithNoDial to avoid dialing.

Stebalien avatar Feb 05 '20 18:02 Stebalien