swarm Deadlock on simultaneous nodeup

I'm having an issue similar to #60, reproducible very often when I bring up containers with the app at roughly the same time. Looks like each node is waiting for another one, and they're perpetually stuck in :syncing state. Here are the :sys.get_status(Swarm.Tracer) results from my 5 nodes: https://pastebin.com/EYLg6YNE . No custom options set, all default; clustering with libcluster gossip strategy.

Jul 12 '18 01:07 kzemek

Please see https://github.com/kzemek/swarm-deadlock-repro for reliable reproduction of the issue.

Jul 12 '18 11:07 kzemek

These are the logs produced with debug: true: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-gistfile1-txt There are no more debug logs after that point.

Jul 12 '18 12:07 kzemek

I've also tried manipulating the choice of sync node in hopes that it would solve the lock: https://github.com/kzemek/swarm/commit/28516d93413fa41a54281ee0c3bb0f7a92a4058e

But instead, the states of the Swarm.Tracker processes got stranger: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-nodes_sync_to_smallest-txt

All nodes tried to sync to repro_2 (the "smallest" node), except repro_2 itself which synced to repro_3. repro_3 synced successfully and was put into :tracking state, while at the same time repro_2 was put into :awaiting_sync_ack and sent cast {sync_recv,<16250.182.0>,{{0,1},0},[]} to repro_3. But sync_recv cast is not handled in :tracking state, so repro_2 got stuck, and so did other nodes that tried to sync to it.

Jul 12 '18 12:07 kzemek

This particular issue is not there when reverting to commit c305633 (pre https://github.com/bitwalker/swarm/commit/412bad990c69748dac300bd69e6a26b988e71b0). The nodes all go into :tracking state almost instantly.

Jul 12 '18 13:07 kzemek

Seeing this issue as well. When I revert to version 3.1 I don't see any problems with deadlocking on startup.

Oct 19 '18 03:10 joxford531

We've been having this issue as well, and I'm pretty sure we also had this in 3.3.1

In our case we observed the following scenario. Lets say we have node A,B and C and the following happens: A - :sync -> B B - :sync -> C C - :sync -> A

All nodes are now in syncing state waiting for a :sync_recv message.

So far we have resolved this with a state timeout in syncing, were stops the syncing and tries another node. It seems to work fine, however, this approach gave a few complications and made it a bit more complex. So a simpler approach could be to drop the pending_sync_request strategy and and just decline the sync request while syncing.

Nov 29 '18 16:11 malmovich