swarm Expected callbacks not received after network heals

trafficstars

Hi,

Thanks for the great lib, awesome work!

We are building a system where we have to process a stream of messages from a remote system. To be more precise, the stream of data are update messages for a specific resources.

So we are using Swarm to manage worker processes for these resources. We create the process when the first message arrives, and let it die from a genserver timeout after a set interval.

We are now in the process of testing the app for a network split situation, and it looks like something is a bit off, or at least, we are a bit confused :-)

First of all, we are using the Distribution.Ring since we will be deploying on an OpenShift instance (Kubernetes) We are testing this by simply connecting 2 iex nodes running our app with Node.connect/1 and Node.disconnect/1

What happens is the following -> When the 2 nodes are connected, a new message for a new resource creates a process on one of the 2 nodes, perfect -> When the 2 nodes are disconnected, all processes that existed before the disconnect are started (if the were not already running) on both machines, still makes sense. -> When the 2 are connected again, the processes keep running on one of the nodes and the processes on the other node receive a {:swarm, :die}, strange, we were actually expecting them to be relocated, the Ring uses a consistent hash?

While the nodes were disconnected, both processes could have processed messages and they maybe in a conflicting state. We have way to actually resolve these conflicts and we were counting on the :resolve_conflict message to inform us about this situation.

However none of the :begin_handoff, :end_handoff or :resolve_conflict messages are received on either Node.

our begin_handoff returns a :resume tuple, but since it is not called ...

Could you help us out?

Here are some debug logs of the disconnect/connect:

Node A "attila" => disconnect

iex([email protected])2> Node.disconnect :"[email protected]"
*DBG* 'Elixir.Swarm.Tracker' receive info {'DOWN',#Ref<0.2529274800.2416967684.136847>,process,<22512.560.0>,
             noconnection} in state tracking
true
iex([email protected])3> *DBG* 'Elixir.Swarm.Tracker' consume info {'DOWN',#Ref<0.2529274800.2416967684.136847>,process,<22512.560.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive info {nodedown,'[email protected]',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodedown,'[email protected]',[{node_type,visible}]} in state tracking
[debug] [swarm on [email protected]] [tracker:handle_monitor] lost connection to "RESOURCE401" (#PID<22512.560.0>) on [email protected], node is down
[info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
[debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
[debug] [swarm on [email protected]] [tracker:handle_topology_change] restarting "RESOURCE400" on [email protected]
[debug] [swarm on [email protected]] [tracker:do_track] starting "RESOURCE400" on [email protected]
[debug] [swarm on [email protected]] [tracker:do_track] started "RESOURCE400" on [email protected]
[debug] [swarm on [email protected]] [tracker:handle_topology_change] restarting "RESOURCE401" on [email protected]
[debug] [swarm on [email protected]] [tracker:do_track] starting "RESOURCE401" on [email protected]
[debug] [swarm on [email protected]] [tracker:do_track] started "RESOURCE401" on [email protected]
[info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete

Node B "mark" => received disconnect

*DBG* 'Elixir.Swarm.Tracker' receive info {'DOWN',#Ref<0.899263733.806354945.65248>,process,<24381.547.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {'DOWN',#Ref<0.899263733.806354945.65248>,process,<24381.547.0>,
             noconnection} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive info {nodedown,'[email protected]',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodedown,'[email protected]',[{node_type,visible}]} in state tracking
[debug] [swarm on [email protected]] [tracker:handle_monitor] lost connection to "RESOURCE402" (#PID<24381.547.0>) on [email protected], node is down
[info] [swarm on [email protected]] [tracker:nodedown] nodedown [email protected]
[debug] [swarm on [email protected]] [tracker:handle_topology_change] topology change (nodedown for [email protected])
[debug] [swarm on [email protected]] [tracker:handle_topology_change] restarting "RESOURCE402" on [email protected]
[debug] [swarm on [email protected]] [tracker:do_track] starting "RESOURCE402" on [email protected]
[debug] [swarm on [email protected]] [tracker:do_track] started "RESOURCE402" on [email protected]
[info] [swarm on [email protected]] [tracker:handle_topology_change] topology change complete

... Some processing going on while nodes are disconnected ...

Node A "attila" => connect with Node B

iex([email protected])3> Node.connect :"[email protected]"   
*DBG* 'Elixir.Swarm.Tracker' receive info {nodeup,'[email protected]',[{node_type,visible}]} in state tracking
true
iex([email protected])4> *DBG* 'Elixir.Swarm.Tracker' consume info {nodeup,'[email protected]',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync,<22512.358.0>,{1,0}} in state syncing
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync,<22512.358.0>,{1,0}} in state syncing
[info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<22512.358.0>,
         {{0,1},0},
         [{entry,<<"RESOURCE402">>,<22512.568.0>,
              #Ref<22512.899263733.806354946.64013>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<22512.560.0>,
              #Ref<22512.899263733.806354949.65462>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<22512.557.0>,
              #Ref<22512.899263733.806354949.64803>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state syncing
"{:swarm, :die} RESOURCE402"
"{:swarm, :die} RESOURCE401"
"{:swarm, :die} RESOURCE400"
[info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
[info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]"]
[info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
[info] [swarm on [email protected]] [tracker:syncing] syncing from [email protected] based on node precedence
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<22512.358.0>,
         {{0,1},0},
         [{entry,<<"RESOURCE402">>,<22512.568.0>,
              #Ref<22512.899263733.806354946.64013>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<22512.560.0>,
              #Ref<22512.899263733.806354949.65462>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<22512.557.0>,
              #Ref<22512.899263733.806354949.64803>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state syncing
[info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
[info] [swarm on [email protected]] [tracker:syncing] local synchronization with [email protected] complete!
[info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared

Node B "mark" => receive 2nd connect

*DBG* 'Elixir.Swarm.Tracker' receive info {nodeup,'[email protected]',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {nodeup,'[email protected]',[{node_type,visible}]} in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync,<24381.349.0>,{1,0}} in state syncing
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync,<24381.349.0>,{1,0}} in state syncing
[info] [swarm on [email protected]] [tracker:ensure_swarm_started_on_remote_node] nodeup [email protected]
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' postpone info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<24381.349.0>,
         {1,3},
         [{entry,<<"RESOURCE402">>,<0.568.0>,
              #Ref<24381.2529274800.2416967683.128726>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<0.560.0>,
              #Ref<24381.2529274800.2416967683.128729>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<0.557.0>,
              #Ref<24381.2529274800.2416967683.128731>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state awaiting_sync_ack
[info] [swarm on [email protected]] [tracker:cluster_wait] joining cluster..
[info] [swarm on [email protected]] [tracker:cluster_wait] found connected nodes: [:"[email protected]"]
[info] [swarm on [email protected]] [tracker:cluster_wait] selected sync node: [email protected]
[info] [swarm on [email protected]] [tracker:syncing] syncing to [email protected] based on node precedence
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<24381.349.0>,
         {1,3},
         [{entry,<<"RESOURCE402">>,<0.568.0>,
              #Ref<24381.2529274800.2416967683.128726>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE402">>]}},
              {{0,{0,10,0}},{{1,0},{0,10,0}}}},
          {entry,<<"RESOURCE401">>,<0.560.0>,
              #Ref<24381.2529274800.2416967683.128729>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE401">>]}},
              {{0,{0,4,0}},{{1,0},{0,4,0}}}},
          {entry,<<"RESOURCE400">>,<0.557.0>,
              #Ref<24381.2529274800.2416967683.128731>,
              #{all_workers => true,
                mfa =>
                    {'Elixir.Cadex.Swarm.Supervisor',start_unit_processor,
                        [<<"RESOURCE400">>]}},
              {{0,{0,2,0}},{{1,0},{0,2,0}}}}]} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,1},{1,1}},{untrack,<24381.547.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,2},{1,2}},{untrack,<24381.550.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume info {event,<24381.349.0>,{{0,3},{1,3}},{untrack,<24381.549.0>}} in state tracking
[info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
[info] [swarm on [email protected]] [tracker:awaiting_sync_ack] local synchronization with [email protected] complete!
[info] [swarm on [email protected]] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<24381.547.0>
[debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<24381.550.0>
[debug] [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<24381.549.0>

Jun 05 '18 13:06 markmeeus

@markmeeus did you ever resolve this? and if so, what was the solution?

Jan 28 '19 12:01 fredr

I've noticed that I have the same problem when running a cluster of 2 nodes, when I run 3 nodes the :begin_handoff and :resolve_conflict callbacks are called as expected.

Jan 28 '19 16:01 fredr

Same here. the processes are untracked but are never restarted on the other node for some reason. None of them events are called. Seems from some other issues that you may need to handle graceful shutdowns manually https://github.com/bitwalker/swarm/pull/83. But they're only talking about "handoffs" even though I just want to do a restart... So it's still confusing.

Feb 20 '20 14:02 x-ji

swarm swarm copied to clipboard

Expected callbacks not received after network heals

Node A "attila" => disconnect

Node B "mark" => received disconnect

Node A "attila" => connect with Node B

Node B "mark" => receive 2nd connect

swarm
swarm copied to clipboard