Unable to resolve conflicting registry when PID is remote

Open uberbrodt opened this issue 7 years ago • 0 comments

I am seeing an issue that crops up when we deploy our code, where conflicting registries between two nodes are never resolved. I'm using the StaticQuorumRing strategy, and our deployments drain a node, bring new node back up, repeat. We're using Swarm to make sure one instance of a process is running in the cluster, so our registries look like this:

iex([email protected])8> Swarm.registered()
[
  {{Commanded.Event.Handler, "profile_kv_repo"}, #PID<0.3132.0>},
  {:profile_event_publisher, #PID<38995.2867.0>},
  {{Commanded.Event.Handler, "profile_assessments"}, #PID<38994.2987.0>},
  {{Commanded.Event.Handler, "event_publication_handler"}, #PID<0.3133.0>},
  {{Commanded.Event.Handler, "repository_handler"}, #PID<38994.2986.0>}
]

Sometimes, two nodes seem to get in an infinite loop of trying to stop and restart one of those processes.

Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.673 [debug] | [[email protected]] Attempting to start distributed process: {Commanded.Event.Handler, "profile_kv_repo"} (Commanded.Event.Handler with args %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}) 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.674 [debug] | [swarm on [email protected]] [tracker:handle_call] registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}] 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.674 [debug] | [swarm on [email protected]] [tracker:do_track] starting {Commanded.Event.Handler, "profile_kv_repo"} on remote node [email protected] 
Jun 21 08:17:08 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:07.675 [debug] | [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}] 
Jun 21 08:17:08 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:07.675 [debug] | [swarm on [email protected]] [tracker:do_track] found {Commanded.Event.Handler, "profile_kv_repo"} already registered on [email protected] 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [swarm on [email protected]] [tracker:start_pid_remotely] {Commanded.Event.Handler, "profile_kv_repo"} already registered to #PID<38830.3839.0> on [email protected], registering locally 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [swarm on [email protected]] [tracker:register] conflicting registration for {Commanded.Event.Handler, "profile_kv_repo"}: remote (#PID<38830.3839.0>) vs. local #PID<38831.3124.0> 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [:"[email protected]"] Named process {Commanded.Event.Handler, "profile_kv_repo"} already started on :"[email protected]" (#PID<38831.3124.0>) 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.684 [debug] | [[email protected]] Named process {Commanded.Event.Handler, "profile_kv_repo"} down due to: :noconnection 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.337 [debug] | [swarm on [email protected]] [tracker:handle_call] untrack #PID<38831.3124.0> 
Jun 21 08:17:09 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:08.337 [debug] | [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<38829.3124.0> 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.341 [debug] | [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<38992.3124.0> 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [info] request_id=2kt2uj2q7l6e7oovss000dj1 | GET /monitor/ping 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [debug] request_id=2kt2uj2q7l6e7oovss000dj1 | Processing with Profile.Web.UtilityController.monitor/2   Parameters: %{}   Pipelines: [] 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [info] request_id=2kt2uj2q7l6e7oovss000dj1 | Sent 200 in 225µs 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [[email protected]] Attempting to start distributed process: {Commanded.Event.Handler, "profile_kv_repo"} (Commanded.Event.Handler with args %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}) 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [swarm on [email protected]] [tracker:handle_call] registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}] 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [swarm on [email protected]] [tracker:do_track] starting {Commanded.Event.Handler, "profile_kv_repo"} on remote node [email protected]

And the above goes on and on...

Remote shelling in, I was able to see that both nodes are communicating, but one node has a registry like this:

iex([email protected])14> Swarm.registered()
[
  {{Commanded.Event.Handler, "profile_kv_repo"}, #PID<38831.3124.0>},
  {:profile_event_publisher, #PID<38994.2867.0>},
  {{Commanded.Event.Handler, "profile_assessments"}, #PID<0.2987.0>},
  {{Commanded.Event.Handler, "event_publication_handler"}, #PID<38830.3133.0>},
  {{Commanded.Event.Handler, "repository_handler"}, #PID<0.2986.0>}
]

So the issue here, is that the remote process indicated for profile_kv_repo didn't exist. But the code that handles these sorts of conflicts here does nothing if the conflicting node is remote, which I believe sets it up for a loop of trying to start that particular process again. Doing Swarm.unregister_name({Commanded.Event.Handler, "profile_kv_repo"}) on the node with the bad entry resolved the issue.

So a couple of questions:

How did it get in this state? Is there a deployment process we should follow to fix this?
Is there something that Swarm can do to resolve this issue? Maybe when Swarm detects a process is down, it can remove it's registration for that pid and then attempt to restart?

Jun 21 '18 13:06 uberbrodt