Clock conflicts and other errors when clustering

Open xrodriguez-betterdoc opened this issue 4 years ago • 0 comments

Hi there!

We are using Swarm in a cluster with 4 nodes (eks). They discover themselves (dynamically) using libcluster (Kubernetes strategy) as many people is doing right now. I didn't expect to get that amount of warnings and errors when using Swarm.. Maybe we are doing something wrong??

To give you some examples of the warnings we receive:

[swarm on {app}@x.x.x.x] [tracker:handle_replica_event] received track event for "{process}", mismatched pids, local clock conflicts with remote clock, event unhandled

** (exit) exited in: :gen_statem.call(Swarm.Tracker, {:track, "{process}", %{mfa: {Module, :start_link, ["{process}", {state}]}}}, 5000)
    ** (EXIT) time out

[swarm on {app}@x.x.x.x] [tracker:ensure_swarm_started_on_remote_node] nodeup for {app}@x.x.x.x was ignored because: {:badrpc, {:EXIT, {:timeout, {:gen_server, :call, [:application_controller, :which_applications]}}}}

[swarm on {app}@x.x.x.x] [tracker:handle_topology_change] handoff failed for "{process}": {:timeout, {GenServer, :call, [#PID<0.11273.0>, {:swarm, :begin_handoff}, 5000]}}

and some others..

Something worrying me is also how Swarm knows where to send the handoff messages. If we are rollout restarting a deployment, does it decide to send those messages to the "new" nodes? Or maybe it's sending them to the ones that will be knocked down in a second?

Thanks in advance!

Nov 24 '21 12:11 xrodriguez-betterdoc