ssb-server
ssb-server copied to clipboard
local peers get stuck in reconnect loops (sometimes)
As identified here with @dominictarr on the lan:

My theory is that this is a race condition where both peers are attempting to connect to each other at the same time, and then discover that they are already connected, so disconnect. Or something like that.
One idea to fix this would be to retry connecting at a slightly randomized duration.
I've seen this before.
The whole gossip schedule could use a good rewrite.
this might also be related to EBT meeting createHistoryStream replication. /cc @pietgeursen
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@mmckegg Have you noticed this in recent version of Scuttlebot? I remember that I used to see it all the time, but not anymore.
Hmm, not sure. Will try this out at art~hack tomorrow with a few peers.
Seems to be caused by local peers being misidentified as legacy
and then disconnected from here:
https://github.com/ssbc/ssb-gossip/blob/5452a2625befad1ab7ae5e3e047df2f27f7b02fa/schedule.js#L220-L225
The code should be updated to something like this:
//*---------------- add isLocal exclusion-------------vvvvvv
connect(peers, ts, 'legacy', and(exports.isLegacy, not(exports.isLocal)), {
quota: 3 - longterm,
factor: 5*min, max: 3*hour, groupMin: 5*min,
})
Though not sure why peers are getting identified as legacy, strange!
legacy peers are identified by wether they support gossip.ping
(which is used to calculate the roundtrip time and clock skew)
isLegacy checks if they have a duration set (created when a connection closes) and if they are a "long term" peer https://github.com/ssbc/ssb-gossip/blob/master/schedule.js#L78
isLongterm means they have support gossip.ping
this calculates clock skew and round trip time,
but also keeps the tcp connection alive.
https://github.com/ssbc/ssb-gossip/blob/master/schedule.js#L71
do these peers have stats? you should be able to see my looking at the output of gossip.peers
I have a new report for this issue.
Background
I've just noticed this issue (or a variant it) when LAN-connecting Patchwork 3.14.1 with Manyverse (my latest dev branch), but my hypothesis is not one of ones mentioned above. I'm doing the gossip refactor (but for context, the issue I'll describe also happens in legacy gossip plugin), and I put debugging lines for the gossip scheduler's connect
and disconnect
calls, so I know of every scheduled connection and disconnection.
Diagnostics
What I learned is that the sudden disconnections are not orchestrated by the gossip scheduler, instead, the sudden disconnections coincide with this error printed in the logs:
Error replicating with @QlCTpvY7p9ty2yOFrv1WU1AE88aoQc4Y7wYal7PFc+w=.ed25519:
22:09:04.924: Error: parent stream is closing
22:09:04.924: at PacketStream.stream (/data/data/se.manyver/files/nodejs-project/index.js:71839:28)
22:09:04.924: at Object.ws.remoteCall (/data/data/se.manyver/files/nodejs-project/index.js:31482:17)
22:09:04.924: at /data/data/se.manyver/files/nodejs-project/index.js:68284:15
22:09:04.924: at remoteCall (/data/data/se.manyver/files/nodejs-project/index.js:59275:19)
22:09:04.924: at EventEmitter.obj.(anonymous function) [as createHistoryStream] (/data/data/se.manyver/files/nodejs-project/index.js:59260:18)
22:09:04.924: at replicate (/data/data/se.manyver/files/nodejs-project/index.js:277:13)
22:09:04.924: at EventEmitter.fallback (/data/data/se.manyver/files/nodejs-project/index.js:337:7)
22:09:04.924: at Object.onceWrapper (events.js:273:13)
22:09:04.924: at EventEmitter.emit (events.js:182:13)
22:09:04.924: at /data/data/se.manyver/files/nodejs-project/index.js:517:15
So my hypothesis is that this error is breaking the RPC connection, and then the gossip plugin (upon receiving a new LAN multicast announcement) connects to the peer again. And repeat.
Sometimes the parent stream is closing
error would not show in Manyverse, but the logs would just say that the RPC with Patchwork was closed, so I suppose in those cases the parent stream is closing
error happened on the Patchwork side, but I don't have those logs (how does one debug the background process for Patchwork, in production) to verify this.
So in summary:
- Alice experiences the
parent stream is closing
error and is responsible for breaking the RPC connection - Bob does not experience the error, but notices the RPC stream got closed
- As soon as one of them receives the next broadcast-stream packet, they initiate a connection
And vice-versa, just swap the roles of Alice and Bob.
Additional info
Perhaps ssb-replicate should handle also the error parent stream is closing
? I know ssb-replicate already handles some error types, but I currently understand very little of how either ssb-replicate and ssb-ebt work.
@staltz the "parent stream is closing" is caused by the tcp stream ending, or at least the parent stream being directly closed. It can't happen from the inside. You won't see that error if there arn't any streams open at the time the muxrpc stream ends though.
this error https://github.com/ssbc/ssb-replicate/blob/81ab9f6d6ad4009af88735d80ff0ace93a6dd54e/legacy.js#L26 sounds like it's the same thing as "parent stream is closing"
other things which can close connections: activity time out. (set timers: inactivitiy: -1
)
@dominictarr
do these peers have stats? you should be able to see my looking at the output of gossip.peers
Hmm, no I don't think they do have stats (at least initially) because they are added to the peer table manually using gossip.add
when the UDP broadcast comes in (rather than persisted in the gossip.json).
What adds the stats? Will they get added after some time?
@mmckegg
What adds the stats?
I think the gossip plugin does here https://github.com/ssbc/ssb-gossip/blob/db2892c5b805812a67fb3878ec4d93988b427e86/index.js#L116-L125