go-libp2p
go-libp2p copied to clipboard
swarm: emit PeerConnectedness event from swarm instead of from hosts
Fixes #1571.
This was surprisingly easy. Unfortunately, I can't find a way to test the deduplication (that no event is emitted when a second connection is dialed), since the swarm is too good at preventing dials to hosts we're already connected to.
So, I think the only way to sanely do this is to write something like: https://github.com/ipfs/go-bitswap/pull/565/files. We probably don't want to block some global queue while sending events.
That is:
- On connect/disconnect, mark the peer as connected/disconnected in some tracker. Add the peer to some "todo" queue, and tell the background process that it needs to wake up.
- In a background event loop, fire connect/disconnect events.
Even if this background process falls behind (someone is blocking events), that's fine. We'll just always fire the "last" event. This does mean we can get unmatched connected/disconnected events, but, IMO, that's fine.
I.e.:
- Connect.
- Disconnect.
- Connect.
- Disconnect.
May be seen as:
- Connect.
- Connect.
- Disconnect.
Or
- Disconnect.
- Disconnect.
Or
- Connect.
- Disconnect.
Or
- Disconnect.
But never:
- Connect.
- Disconnect.
- Connect.
Importantly, the final state is always accurate, which is what users care about. At the moment, it's possible for the last event fired to be "connect", which could cause memory leaks.
Rebased this PR on top of the current master.
I think emitting the event while holding the conns mutex should be fine:
- It's correct since the the
swarm.conns.m[peer.ID]is the source of truth: Once we're down to 0 connections in this list, we're officially disconnected from the peer. - It's ok performance-wise, because emitting events is cheap, as long as the subscribers consume events quickly enough. Now we could (somewhat) protect against that by introducing a worker Go routine and another tracking data structure as suggested by @Stebalien in https://github.com/libp2p/go-libp2p/pull/1574#issuecomment-1143048078, but that data structure would need to be limited in size somehow, so this is no better than using a larger channel for the event bus subscriber. Using the new event bus dashboard (#2038) it should be easy to detect a misconfiguration there.
The following sequence can show up in two different ways:
- Connect.
- Disconnect.
- Connect.
- Disconnect.
First way:
- Connect
- Disconnect
- Connect
- Disconnect
Second way (when the two connects are processed first):
- Connect
- Disconnect
I introduced the change in the last commit. @MarcoPolo, @Stebalien does this reasoning make sense to you?
Make a new test transport that wraps an existing transport. Put two multiaddrs for this test transport (they might have to be slightly different if we're deduping these somewhere). In the test transport have them synchronize with a waitgroup to return a dial at the same time.
That doesn't work either. The dial worker loop somewhere deduplicates. I have no intention of diving into that mess to figure out where.