swarm icon indicating copy to clipboard operation
swarm copied to clipboard

Handover begins before Supervision tree is fully loaded

Open mbaeuerle opened this issue 7 years ago • 5 comments

Swarm: 3.2.1 Elixir: 1.6.1 Erlang: 20.2.2

Suppose we have two nodes A and B. We have a worker W which was spawned on node A. Now node A crashes and B takes over W as expected. When A is starting again Swarm is trying to do the handover of W. But in our case this happens before the Supervisor Test.Supervisor for W is started yet. Swarm is retrying the handover and succeeds after one or two retries but nevertheless there is an exception thrown like the one shown:

[error] [swarm on [email protected]] [tracker:handle_handoff] ** (exit) exited in: GenServer.call(Test.Supervisor, {:start_child, [:state]}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:821: GenServer.call/3
    (swarm) lib/swarm/tracker/tracker.ex:646: Swarm.Tracker.handle_handoff/3
    (stdlib) gen_statem.erl:1240: :gen_statem.call_state_function/5
    (stdlib) gen_statem.erl:1012: :gen_statem.loop_event/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

Maybe there is a way to tell when node A is fully up again (with the whole Supervision tree loaded) and Swarm can savely begin the handoff.

mbaeuerle avatar Feb 06 '18 09:02 mbaeuerle

Unfortunately there generally isn't a way to know when a node is "fully" started, even how we detect when Swarm is started is a less than ideal. Providing prerequisites to handing off a given process could be stored as part of it's metadata, but I'd have to think about the best way to configure that, since any kind of waiting may result in locking up the tracker.

@slashdotdash Thoughts?

bitwalker avatar Feb 07 '18 20:02 bitwalker

@bitwalker I'm not sure how best to handle this. If it succeeds after one or two retries we could consider lowering the logging level from error to warn/info as it's only a transient error.

slashdotdash avatar Feb 12 '18 20:02 slashdotdash

I think the log level is appropriate, but we probably should evaluate a better way of either configuring or determining when a node is considered "available". This will come up again when support for supervision is merged into Swarm, since we'll need to wait for the supervisors to start on the given nodes. Our current approach of waiting for just the Swarm application to be started was primitive, and not really the right solution, but it worked a bit better than what was there before. I suspect what we need is to provide a configuration option which takes a module/function/args triple to call which we execute on each node when it starts up, and that either blocks until the node is available or returns either :ok or :unavailable or something, and which we monitor until :ok is returned. Blocking isn't great, since it has the potential to lock up the tracker for significant amounts of time, but an "async" API is more error prone from a user perspective. I'm not sure which is the best approach, but I'm leaning towards the latter.

Other than that, I'm not sure there is a better option, I don't think it's a good idea to default to retrying blindly, since we won't know why the process isn't available, it could be a legitimate error or just a timing issue (either due to startup or a crash).

bitwalker avatar Feb 12 '18 20:02 bitwalker

I have something similar, and am still getting hold of everything here while learning.

I am experimenting with a hello world app, which basically spins up phoenix servers with libcluster and swarm for genserver to print out all the connected nodes from any one of the server nodes, and if one shutdown, the newly created (when inside something like k8s) node should restart the genserver.

The libcluster part works fine, but the swarm keeps giving me this error. I have not changed the error that much, but am not sure what/where to debug more?

EDIT: config for swarm debug turned on

[info] [swarm on nonode@nohost] [tracker:init] started
*DBG* 'Elixir.Swarm.Tracker' receive call {whereis,<<"hello-genserver">>} from <0.318.0> in state cluster_wait
*DBG* 'Elixir.Swarm.Tracker' postpone call {whereis,<<"hello-genserver">>} from <0.318.0> in state cluster_wait
*DBG* 'Elixir.Swarm.Tracker' receive info cluster_join in state cluster_wait
*DBG* 'Elixir.Swarm.Tracker' consume info cluster_join in state cluster_wait => tracking
[info] [swarm on nonode@nohost] [tracker:cluster_wait] joining cluster..
[info] [swarm on nonode@nohost] [tracker:cluster_wait] no connected nodes, proceeding without sync
*DBG* 'Elixir.Swarm.Tracker' consume call {whereis,<<"hello-genserver">>} from <0.318.0> in state tracking
*DBG* 'Elixir.Swarm.Tracker' receive call {track,<<"hello-genserver">>,
         #{mfa =>
               {'Elixir.HelloGenserver.Supervisor',register,
                   [<<"hello-genserver">>]}}} from <0.318.0> in state tracking
[debug] [swarm on nonode@nohost] [tracker:handle_call] registering "hello-genserver" as process started by Elixir.HelloGenserver.Supervisor.register/1 with args ["hello-genserver"]
[debug] [swarm on nonode@nohost] [tracker:do_track] starting "hello-genserver" on nonode@nohost
[warn] [swarm on nonode@nohost] [tracker:do_track] ** (exit) exited in: GenServer.call(HelloGenserver.Supervisor, {:start_child, ["hello-genserver"]}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:979: GenServer.call/3
    (hello_genserver) lib/hello_genserver/supervisor.ex:20: HelloGenserver.Supervisor.register/1
    (swarm) lib/swarm/tracker/tracker.ex:1302: Swarm.Tracker.do_track/2
    (stdlib) gen_statem.erl:1660: :gen_statem.call_state_function/5
    (stdlib) gen_statem.erl:1023: :gen_statem.loop_event_state_function/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

[error] Error registering name from Swarm: {:noproc, {GenServer, :call, [HelloGenserver.Supervisor, {:start_child, ["hello-genserver"]}, :infinity]}}
*DBG* 'Elixir.Swarm.Tracker' consume call {track,<<"hello-genserver">>,
         #{mfa =>
               {'Elixir.HelloGenserver.Supervisor',register,
                   [<<"hello-genserver">>]}}} from <0.318.0> in state tracking
[info] Running HelloGenserverWeb.Endpoint with cowboy 2.6.1 at http://localhost:4000

Here is my application.ex

defmodule HelloGenserver.Application do
  # See https://hexdocs.pm/elixir/Application.html
  # for more information on OTP Applications
  @moduledoc false

  use Application
  require Logger

  @name "hello-genserver"

  def start(_type, _args) do
    # List all child processes to be supervised
    children =
      [
        # Start the endpoint when the application starts
        HelloGenserverWeb.Endpoint
      ]
      |> register_or_skip

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: HelloGenserver.Spv]
    Supervisor.start_link(children, opts)
  end

  # Tell Phoenix to update the endpoint configuration
  # whenever the application is updated.
  def config_change(changed, _new, removed) do
    HelloGenserverWeb.Endpoint.config_change(changed, removed)
    :ok
  end

  defp register_or_skip(children) do
    # Libcluster configuration

    topologies = [
      nodes: [
        strategy: Cluster.Strategy.Gossip
      ]
    ]

    case Swarm.whereis_or_register_name(@name, HelloGenserver.Supervisor, :register, [@name]) do
      {:ok, _pid} ->
        [
          # Start the cluster supervisor
          {Cluster.Supervisor, [topologies, [name: HelloGenserver.ClusterSupervisor]]}
          | children
        ]

      {:error, reason} ->
        Logger.error("Error registering name from Swarm: #{inspect(reason)}")
        children
    end
  end
end

For the complete code, you may visit here

sreecodeslayer avatar Mar 11 '19 05:03 sreecodeslayer

Alright, so for me the issue was that I actually had to call HelloGenserver.Supervisor.start_link() before asking Swarm about registering. I believe this is not shown in the example (from where I started coding in most of the hello world app's swarm impl)

ie.,

defmodule MyApp.ExampleUsage do
  ...snip...

  @doc """
  Starts worker and registers name in the cluster, then joins the process
  to the `:foo` group
  """
  def start_worker(name) do
    # setup supervisor
    MyApp.Supervisor.start_link()

    {:ok, pid} = Swarm.register_name(name, MyApp.Supervisor, :register, [name])
    Swarm.join(:foo, pid)
  end
...

If there is a better way to do this, please let me know.

sreecodeslayer avatar Mar 11 '19 10:03 sreecodeslayer