storex-sync icon indicating copy to clipboard operation
storex-sync copied to clipboard

Attempt reconnect via the FB channel on init sync stall

Open poltak opened this issue 5 years ago • 4 comments

TODOs:

  • [ ] get re-connection reproducible in test setup
    • current implementation does not work!
  • [ ] write test for case: "One side detects the stall, the other one doesn't for whatever reason"
    • [ ] implement timeout for each reconnect attempt
  • [ ] write test for case: "One side detects the stall, the other one does so later"
    • [ ] implement message queue for signal channel
  • [ ] write test for case: "Both sides detect the stall and both want to initiate the signalling"
    • [ ] figure out if defined behaviours for sender and receiver roles covers this
  • [ ] rename the CB passed in to setup fast sync signal channel

Discussion points:

  1. I had to update the init sync to be able to pass in an optional callback to be able to do post-instantiation setup on the FB channel, so in the integration tests I could set custom stall timeouts, etc. This is outlined in 61ac5f0 . I personally feel this is less elegant than it could be. I feel like you might have had a better solution in mind when we discussed this yesterday, which I wasn't able to work out?
  2. Currently this only attempts to reconnect once, when a stall is encountered. The reconnect attempt can fail.
    1. Do we want to be able to try and do it multiple times?
    2. I added a new reconnected event in 51e3582 that is fired from the FB channel after stalled occurs and the re-connection is successful. In the case it's not successful, it fires stalled a second time. This is just a suggestion. Do you have any preferences for how we handle this?
  3. I just realized that the receiveMessage method does not exist on SignalChannel. It's commented out in the interface definition and the implementations in simple-signalling do not seem to have it: https://github.com/WorldBrain/simple-signalling/blob/master/ts/firebase.ts#L39 From what I understanding, this stops me from actually running my test and means the reconnection won't actually work. Did you have plans for this already?

We'll also need to update at least the mobile init sync flow which currently listens and reacts to the stalled event. Assuming the current flow, described in 2.ii., we'll need to at least update it to not move to the stall error screen unless it emits 2 stalled. I'm not sure how the ext's sync handles stalls, but that may need to be updated too.

poltak avatar Apr 07 '20 06:04 poltak

I'd say the callback is no problem, but I'd like to see a more descriptive name there, like onFastSyncChannelCreated.

Yes, we do want to retry to connect multiple times, maybe keeping it at 3 max?

With regards to the events, I'd do the following:

  • When we detect the connection is stalled, we fire an event reconnect with { attempt: number }
  • When we keep firing this event with the attempt number increasing
  • On successful reconnect, we fire reconnected
  • After the retry limit exceeds, we fire stalled

We need to decide whether we want to display this in the UI.

Indeed, the receiveMessage event was removed from the SignalChannel in favor of listening to events. This prevents the class from needing to buffer events internally, which we can instead just do externally if we want.

One bigger problem I see here is that there might be a race condition if I see it correctly. If peer A detects a stall sooner than peer B, peer A will initiate reconnecting, but peer B won't be listening for messages yet. Am I correct? We need to account for a few scenarios here off the top of my head:

  • One side detects the stall, the other one doesn't for whatever reason
  • One side detects the stall, the other one does so later
  • Both sides detect the stall and both want to initiate the signalling

ShishKabab avatar Apr 07 '20 10:04 ShishKabab

Also, am I right the signalling channel gets released after the first connection is set up? This should now happen after the whole sync is complete.

ShishKabab avatar Apr 07 '20 10:04 ShishKabab

there might be a race condition if I see it correctly. If peer A detects a stall sooner than peer B, peer A will initiate reconnecting, but peer B won't be listening for messages yet. Am I correct?

I was going by the (perhaps overly confident) assumptions that if one side stalled, the other side would always eventually stall, and that there would be some kind of message queue on the signalling channels. Though they are not right.

RE each of those scenarios:

  1. One side detects the stall, the other one doesn't for whatever reason

This could be solved with a timeout on each reconnect attempt.

  1. One side detects the stall, the other one does so later

This could be solved with some event queue.

  1. Both sides detect the stall and both want to initiate the signalling

I need to think about this some more, though I feel it will be clearer once I've updated how the reconnection attempts work and the events that get emitted.

Also, am I right the signalling channel gets released after the first connection is set up? This should now happen after the whole sync is complete.

I see, so the channel remains open throughout the entire sync so it can attempt reconnects over it. Makes sense. Cheers

poltak avatar Apr 07 '20 12:04 poltak

I am having a fair amount of trouble making my test work. Every time I set it up to get into a stall and attempt reconnect, it exceeds the 2000ms timeout for tests. I got to the point, after putting a few console.logs around the place, where I thought it was because one side isn't yet ready to listen for the reconnect signal when it gets sent from the other side. I thought this would be solved by implementing some kind of buffer to put all incoming signals on, which I did an overly complicated test implementation in 810feff, although it still did not manage to reach communication between the two sides. Will think about more possible solutions to try.

I am feeling like things are starting to get to the point where it's really complicated adding these extra things in, like the buffer, and I'm feeling fairly demoralized after spending 3 days now on this task, which was I originally expected to talk maybe half of my Monday, and I am still not seeing the end in sight. @ShishKabab what do you think about doing a bit of a read through and review of the work up until this point? I feel like I may have been going in the wrong direction at some point or am missing something obvious, and that it would help having another brain to critique the progress up until now. If you do have a look, I would recommend looking from 4af8dd1 as the following latest commit (810feff) was more of an experiment.

poltak avatar Apr 08 '20 12:04 poltak