Unable to resolve conflicting registry when PID is remote
I am seeing an issue that crops up when we deploy our code, where conflicting registries between two nodes are never resolved. I'm using the StaticQuorumRing strategy, and our deployments drain a node, bring new node back up, repeat. We're using Swarm to make sure one instance of a process is running in the cluster, so our registries look like this:
iex([email protected])8> Swarm.registered()
[
{{Commanded.Event.Handler, "profile_kv_repo"}, #PID<0.3132.0>},
{:profile_event_publisher, #PID<38995.2867.0>},
{{Commanded.Event.Handler, "profile_assessments"}, #PID<38994.2987.0>},
{{Commanded.Event.Handler, "event_publication_handler"}, #PID<0.3133.0>},
{{Commanded.Event.Handler, "repository_handler"}, #PID<38994.2986.0>}
]
Sometimes, two nodes seem to get in an infinite loop of trying to stop and restart one of those processes.
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.673 [debug] | [[email protected]] Attempting to start distributed process: {Commanded.Event.Handler, "profile_kv_repo"} (Commanded.Event.Handler with args %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil})
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.674 [debug] | [swarm on [email protected]] [tracker:handle_call] registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}]
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.674 [debug] | [swarm on [email protected]] [tracker:do_track] starting {Commanded.Event.Handler, "profile_kv_repo"} on remote node [email protected]
Jun 21 08:17:08 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:07.675 [debug] | [swarm on [email protected]] [tracker:handle_call] :"[email protected]" is registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}]
Jun 21 08:17:08 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:07.675 [debug] | [swarm on [email protected]] [tracker:do_track] found {Commanded.Event.Handler, "profile_kv_repo"} already registered on [email protected]
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [swarm on [email protected]] [tracker:start_pid_remotely] {Commanded.Event.Handler, "profile_kv_repo"} already registered to #PID<38830.3839.0> on [email protected], registering locally
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [swarm on [email protected]] [tracker:register] conflicting registration for {Commanded.Event.Handler, "profile_kv_repo"}: remote (#PID<38830.3839.0>) vs. local #PID<38831.3124.0>
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [:"[email protected]"] Named process {Commanded.Event.Handler, "profile_kv_repo"} already started on :"[email protected]" (#PID<38831.3124.0>)
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.684 [debug] | [[email protected]] Named process {Commanded.Event.Handler, "profile_kv_repo"} down due to: :noconnection
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.337 [debug] | [swarm on [email protected]] [tracker:handle_call] untrack #PID<38831.3124.0>
Jun 21 08:17:09 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:08.337 [debug] | [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<38829.3124.0>
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.341 [debug] | [swarm on [email protected]] [tracker:handle_replica_event] replica event: untrack #PID<38992.3124.0>
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [info] request_id=2kt2uj2q7l6e7oovss000dj1 | GET /monitor/ping
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [debug] request_id=2kt2uj2q7l6e7oovss000dj1 | Processing with Profile.Web.UtilityController.monitor/2 Parameters: %{} Pipelines: []
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [info] request_id=2kt2uj2q7l6e7oovss000dj1 | Sent 200 in 225µs
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [[email protected]] Attempting to start distributed process: {Commanded.Event.Handler, "profile_kv_repo"} (Commanded.Event.Handler with args %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil})
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [swarm on [email protected]] [tracker:handle_call] registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}]
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [swarm on [email protected]] [tracker:do_track] starting {Commanded.Event.Handler, "profile_kv_repo"} on remote node [email protected]
And the above goes on and on...
Remote shelling in, I was able to see that both nodes are communicating, but one node has a registry like this:
iex([email protected])14> Swarm.registered()
[
{{Commanded.Event.Handler, "profile_kv_repo"}, #PID<38831.3124.0>},
{:profile_event_publisher, #PID<38994.2867.0>},
{{Commanded.Event.Handler, "profile_assessments"}, #PID<0.2987.0>},
{{Commanded.Event.Handler, "event_publication_handler"}, #PID<38830.3133.0>},
{{Commanded.Event.Handler, "repository_handler"}, #PID<0.2986.0>}
]
So the issue here, is that the remote process indicated for profile_kv_repo didn't exist. But the code that handles these sorts of conflicts here does nothing if the conflicting node is remote, which I believe sets it up for a loop of trying to start that particular process again. Doing Swarm.unregister_name({Commanded.Event.Handler, "profile_kv_repo"}) on the node with the bad entry resolved the issue.
So a couple of questions:
- How did it get in this state? Is there a deployment process we should follow to fix this?
- Is there something that Swarm can do to resolve this issue? Maybe when Swarm detects a process is down, it can remove it's registration for that pid and then attempt to restart?