Reevaluate MessageQueue dial logic
Currently the MessageQueue will wait up to 10 minutes for a dial to a peer to succeed: https://github.com/ipfs/go-bitswap/blob/5030b8dd39738778c2cc621a1d7b35cca010ba28/internal/messagequeue/messagequeue.go#L471-L477
However IIUC we only ever try to send messages to peers that we are already connected to.
Proposal:
- remove the retry logic in MessageQueue
- if the dial times out, remove the peer from the session
Do we remove peers from sessions when they disconnect?
Note: We're still probably going to want a retry in the message sender as the first connection we try may actually be dead, even if we have a separate connection. Unfortunately, we can easily get into a state where:
- Our connection gets cut (e.g., some nasty middlebox sens a reset or the other peer just decides to kill it).
- The other peer reconnects before we notice (1). We now have two connections.
- The other peer asks us for a block using the new connection.
- We try to send it back on the old connection.
Currently, we'll fail at step 4 and retry. If we removed the retry, we'd just drop the message entirely.
Really, we need to re-evaluate how we handle "connections" and state. (https://github.com/ipfs/specs/pull/201)
Do we remove peers from sessions when they disconnect?
Yes, the PeerManager tells the Session when a peer disconnects
- The other peer asks us for a block using the new connection.
- We try to send it back on the old connection.
Note that the reconnect code referred to above is on the client side: the 10 minute dial timeout is for when we're sending a want.
MessageQueue will attempt to send the message 10 times. So in the scenario you're describing of our connection failing, and the other side opening a new connection, we would expect:
- the first send attempt to fail
- the second attempt to reconnect (using the new connection)
- successful send
Is that correct?
Note that the reconnect code referred to above is on the client side: the 10 minute dial timeout is for when we're sending a want.
Oh. Never mind.
Is that correct?
Probably? Technically, we could have a couple of failures (multiple semi-dead connections). Really, I'd try several times while we think we're connected.
At the very least, we can remove the connect logic. Really, we can (and probably should?) use https://godoc.org/github.com/libp2p/go-libp2p-core/network#WithNoDial to avoid dialing.