go-libp2p-pubsub icon indicating copy to clipboard operation
go-libp2p-pubsub copied to clipboard

Improve handling of disconnected peers in PubSub

Open Wondertan opened this issue 1 year ago • 6 comments

Context

Right now, the implementation handles dead peers smartly but uncommonly. PubSub determines a dead peer if the write stream returns EOF or other error. This is possible due to the dual-stream model, where peers establish long-lived outbound streams to each other for writes and receive inbound streams for reads.

Problems

  • Isn't idiomatic. The canonical go-libp2p way of handling dead/disconnected peers relies on the Notfiee interface(deprecated) or EvtPeerConnectednessChanged.
  • Undocumented. The exact behavior should be brought up to specs, as the recent dual-stream requirement, and synchronized across implementations.
  • Allocates. To handle dead peers, we allocate a Reader with 4K buffers per peer. Moreover, a malicious peer can send a message and allocate a receive buffer in the Reader of up to 1M. A small yet DOS vector.

Improvement Paths

  • Migrate to the event system, specifically EvtPeerConnectednessChanged, to handle dead peers and spec it out.

Not sure what event system status is in other libp2p implementations.

  • Keep the existing way of handling dead/disconnected peers. Remove allocations and DOS vector by reading the varint only and killing the stream. Spec out the behavior. (by @vyzo )

Other ideas?

Wondertan avatar Nov 19 '22 13:11 Wondertan

The advantage of the current system is that it is very responsive and it also works at the stream level.

Migrating to use the event bus (which appeared in libp2p way after pubsub :) may be worthwhile, but I would suggest that we improve the current peer detection to remove allocation first. This should be straightforward to implement, and it is not mutually exclusive with using the event bus later.

vyzo avatar Nov 19 '22 13:11 vyzo

@vyzo, should we implement this or wait for inputs from other folks:?

Wondertan avatar Nov 19 '22 15:11 Wondertan

I think we can implement it as it is, it will fix the (minor) dos vecror.

vyzo avatar Nov 19 '22 16:11 vyzo

Ok. Follow up question. Why do we need to read specifically varint or is one byte enough? Should we kill a peer or reset a stream if the unexpected message arrives?

Wondertan avatar Nov 19 '22 16:11 Wondertan

I think just a byte is enough. Killing the stream is reasonable i think, it may well be unusable at this point and indicative of a bug on the other side.

vyzo avatar Nov 19 '22 17:11 vyzo

There is also a backoff mechanism for dealing with dead peers, so we might as well declare the peer dead and enter it.

vyzo avatar Nov 19 '22 17:11 vyzo