automerge-classic icon indicating copy to clipboard operation
automerge-classic copied to clipboard

Integration tests for sync protocol

Open HerbCaudill opened this issue 3 years ago • 9 comments

In order to understand the sync protocol I made some more tests, using the protocol as an application might, rather than exercising its internals.

All of these tests pass except for the last one, which involves three peers, each making concurrent changes while disconnected. I get into a loop in which two of the peers keep sending each other their version without converging. (see https://github.com/automerge/automerge/pull/339#issuecomment-832681772 ) I'm not sure if the problem is with my helper code or with the sync protocol itself.

HerbCaudill avatar May 05 '21 15:05 HerbCaudill

At Daniel Weaver's suggestion I've removed the try/catches that swallow Automerge's "outdated doc" error, on the hunch that they are symptoms of the issue I'm having; so now a couple more of the 3-peer tests fail.

HerbCaudill avatar May 05 '21 16:05 HerbCaudill

I have started working through this, but I must admit I find it quite hard to follow what is going on with all the EventEmitters, as they result in very indirect control flow. I wonder if it might be easier to understand the code if one peer sending a message just directly calls a function on the peer that is receiving.

ept avatar May 05 '21 21:05 ept

I should note that I have had >2 peers sync in my demo code, though I haven't checked it in a while.

On Wed, May 5, 2021 at 3:34 PM Speros Kokenes @.***> wrote:

@ept https://github.com/ept playing devil's advocate here, I agree async code can be confusing but it does help the community to see examples of how they would actually use the syncing protocol. The synchronous tests to date don't actually reflect how anyone would use the library and make it a little confusing to extrapolate from.

That being said, maybe that doesn't matter for a simple test where you are just trying to confirm that 3 peers can sync

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/automerge/automerge/pull/352#issuecomment-833093095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWQALHXSLWBZCOAGTYKTTMHBYFANCNFSM44FIWS2A .

-- Peter van Hardenberg "Everything was beautiful, and nothing hurt."—Kurt Vonnegut

pvh avatar May 05 '21 23:05 pvh

Not 100% sure what the issue is but I moved some stuff around in https://github.com/automerge/automerge/pull/353 but kept the tests the same and they pass.. Sorry linter is failing. Feel free to take anything that’s useful and close it

I think it has to do with this bit here: https://github.com/automerge/automerge/pull/352/files#diff-7095874604758fffdc7bf502d9864c58e56a5a49730e50e6bcafa595409db0e7R182-R191

Following that code path it generates a new message instead of receiving it first?

dan-weaver avatar May 06 '21 00:05 dan-weaver

I had a go at an alternative implementation, which seems to work fine with three peers all connected to each other: https://gist.github.com/ept/fe4f772bcc9294c953728b2407f46c77

My version does not deliver messages immediately, but rather enqueues them (there is a separate queue for every pair of peers), and then delivers one message per time step. At the moment it just picks the first message in the first nonempty queue to deliver, but you could also imagine a different scheduler (e.g. round robin or randomised). I think this is a better simulation of how a real network behaves. In your implementation you get deep call stacks as A sends a message to B and then B sends a message back to A and then A sends another message to B... all synchronously. That makes it really hard to reason about what is happening when.

ept avatar May 06 '21 10:05 ept

Thanks everyone. I've incorporated @dan-weaver 's structure from #353, which simplifies things a lot by having just one document that gets updated, rather than separately tracking one document per connection.

Martin @ept is also right that having things run synchronously doesn't match reality and is hard to reason about. I've added a 1-ms timeout before the channel emits the message, which yields so other things can happen before anyone reacts to the message. With that change everything works with as many as 5 peers.

Since this design doesn't give me a way to tell when all peers have converged, I put in a hack that just pauses for a sufficient (on my box!) delay before checking for convergence. I'll adapt Martin's queue design to do that correctly - just wanted to call off the hounds in case anyone was still looking into this (@pvh plz go back offline! 😀)

HerbCaudill avatar May 06 '21 12:05 HerbCaudill

OK I've adapted @ept's Network and Peer helper classes and everything works now without any hacks. I've tested this with up to 7 peers connected in various configurations.

HerbCaudill avatar May 07 '21 11:05 HerbCaudill

Does this PR complement https://github.com/automerge/automerge/pull/353? I am planning to use the integration tests to better understand what the sync protocol is really.

If this PR has other coverage than https://github.com/automerge/automerge/pull/353 it would be good to get it merged.

echarles avatar Jun 14 '21 06:06 echarles

#353 was just @dan-weaver showing me how to get my tests working — he closed his PR after I incorporated his changes into this one.

I wrote this just to understand the sync protocol myself, but I do think it would be useful to include something like this in the codebase, as it exercises the protocol at a higher level than the existing tests. I also think it's useful as documentation, as a working example of the sync process.

HerbCaudill avatar Jun 15 '21 11:06 HerbCaudill