ssb-server local peers get stuck in reconnect loops (sometimes)

As identified here with @dominictarr on the lan:

reconnect local peers

My theory is that this is a race condition where both peers are attempting to connect to each other at the same time, and then discover that they are already connected, so disconnect. Or something like that.

One idea to fix this would be to retry connecting at a slightly randomized duration.

May 17 '18 00:05 mmckegg

I've seen this before.

The whole gossip schedule could use a good rewrite.

May 17 '18 12:05 evbogue

this might also be related to EBT meeting createHistoryStream replication. /cc @pietgeursen

May 17 '18 23:05 dominictarr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 18 '18 21:12 stale[bot]

@mmckegg Have you noticed this in recent version of Scuttlebot? I remember that I used to see it all the time, but not anymore.

Dec 18 '18 22:12 christianbundy

Hmm, not sure. Will try this out at art~hack tomorrow with a few peers.

Dec 18 '18 22:12 mmckegg

Seems to be caused by local peers being misidentified as legacy and then disconnected from here:

https://github.com/ssbc/ssb-gossip/blob/5452a2625befad1ab7ae5e3e047df2f27f7b02fa/schedule.js#L220-L225

The code should be updated to something like this:

//*---------------- add isLocal exclusion-------------vvvvvv
connect(peers, ts, 'legacy', and(exports.isLegacy, not(exports.isLocal)), {
  quota: 3 - longterm,
  factor: 5*min, max: 3*hour, groupMin: 5*min,
})

Though not sure why peers are getting identified as legacy, strange!

Mar 17 '19 10:03 mmckegg

legacy peers are identified by wether they support gossip.ping (which is used to calculate the roundtrip time and clock skew)

isLegacy checks if they have a duration set (created when a connection closes) and if they are a "long term" peer https://github.com/ssbc/ssb-gossip/blob/master/schedule.js#L78

isLongterm means they have support gossip.ping this calculates clock skew and round trip time, but also keeps the tcp connection alive. https://github.com/ssbc/ssb-gossip/blob/master/schedule.js#L71

do these peers have stats? you should be able to see my looking at the output of gossip.peers

Mar 17 '19 22:03 dominictarr

I have a new report for this issue.

Background

I've just noticed this issue (or a variant it) when LAN-connecting Patchwork 3.14.1 with Manyverse (my latest dev branch), but my hypothesis is not one of ones mentioned above. I'm doing the gossip refactor (but for context, the issue I'll describe also happens in legacy gossip plugin), and I put debugging lines for the gossip scheduler's connect and disconnect calls, so I know of every scheduled connection and disconnection.

Diagnostics

What I learned is that the sudden disconnections are not orchestrated by the gossip scheduler, instead, the sudden disconnections coincide with this error printed in the logs:

Error replicating with @QlCTpvY7p9ty2yOFrv1WU1AE88aoQc4Y7wYal7PFc+w=.ed25519:
22:09:04.924:    Error: parent stream is closing
22:09:04.924:     at PacketStream.stream (/data/data/se.manyver/files/nodejs-project/index.js:71839:28)
22:09:04.924:     at Object.ws.remoteCall (/data/data/se.manyver/files/nodejs-project/index.js:31482:17)
22:09:04.924:     at /data/data/se.manyver/files/nodejs-project/index.js:68284:15
22:09:04.924:     at remoteCall (/data/data/se.manyver/files/nodejs-project/index.js:59275:19)
22:09:04.924:     at EventEmitter.obj.(anonymous function) [as createHistoryStream] (/data/data/se.manyver/files/nodejs-project/index.js:59260:18)
22:09:04.924:     at replicate (/data/data/se.manyver/files/nodejs-project/index.js:277:13)
22:09:04.924:     at EventEmitter.fallback (/data/data/se.manyver/files/nodejs-project/index.js:337:7)
22:09:04.924:     at Object.onceWrapper (events.js:273:13)
22:09:04.924:     at EventEmitter.emit (events.js:182:13)
22:09:04.924:     at /data/data/se.manyver/files/nodejs-project/index.js:517:15

So my hypothesis is that this error is breaking the RPC connection, and then the gossip plugin (upon receiving a new LAN multicast announcement) connects to the peer again. And repeat.

Sometimes the parent stream is closing error would not show in Manyverse, but the logs would just say that the RPC with Patchwork was closed, so I suppose in those cases the parent stream is closing error happened on the Patchwork side, but I don't have those logs (how does one debug the background process for Patchwork, in production) to verify this.

So in summary:

Alice experiences the parent stream is closing error and is responsible for breaking the RPC connection
Bob does not experience the error, but notices the RPC stream got closed
As soon as one of them receives the next broadcast-stream packet, they initiate a connection

And vice-versa, just swap the roles of Alice and Bob.

Additional info

Perhaps ssb-replicate should handle also the error parent stream is closing? I know ssb-replicate already handles some error types, but I currently understand very little of how either ssb-replicate and ssb-ebt work.

Jun 25 '19 19:06 staltz

@staltz the "parent stream is closing" is caused by the tcp stream ending, or at least the parent stream being directly closed. It can't happen from the inside. You won't see that error if there arn't any streams open at the time the muxrpc stream ends though.

this error https://github.com/ssbc/ssb-replicate/blob/81ab9f6d6ad4009af88735d80ff0ace93a6dd54e/legacy.js#L26 sounds like it's the same thing as "parent stream is closing"

other things which can close connections: activity time out. (set timers: inactivitiy: -1)

Jun 26 '19 12:06 dominictarr

@dominictarr

do these peers have stats? you should be able to see my looking at the output of gossip.peers

Hmm, no I don't think they do have stats (at least initially) because they are added to the peer table manually using gossip.add when the UDP broadcast comes in (rather than persisted in the gossip.json).

What adds the stats? Will they get added after some time?

Jun 27 '19 13:06 mmckegg

@mmckegg

What adds the stats?

I think the gossip plugin does here https://github.com/ssbc/ssb-gossip/blob/db2892c5b805812a67fb3878ec4d93988b427e86/index.js#L116-L125

Jun 28 '19 08:06 staltz

ssb-server ssb-server copied to clipboard

local peers get stuck in reconnect loops (sometimes)

Background

Diagnostics

Additional info

ssb-server
ssb-server copied to clipboard