libcluster icon indicating copy to clipboard operation
libcluster copied to clipboard

A proposal of enhancement to the epmd strategy

Open qhwa opened this issue 3 years ago • 4 comments

Discussed in https://github.com/bitwalker/libcluster/discussions/167

Originally posted by qhwa September 23, 2021 Hi, first of all, thanks for the awesome library! It's a fundamental part of our Elixir application in Kubernetes.

Right now libcluster has the epmd strategy which works just fine. However, in our scenarios, we want to have more features upon it:

  1. The whole application can only be started successfully when the remote node is connected, which means the current node has joined the cluster. We want our application to initialize only after successfully joining the cluster, like how sync_nodes_mandatory works in Erlang.
  2. After the application starts, if the remote node disconnects, we want libcluster to retry connecting it in the background, in order to reconnect when the node comes back online again.

I created a strategy based on the epmd strategy to achieve that.

The strategy is merely a module:

defmodule ClusterStrategy.ContactNodeStrategy do
  alias Cluster.Strategy.State
  use Cluster.Strategy

  require Logger

  @default_retry_delay 5_000
  @default_max_retries 12_000

  @impl true
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init([%State{config: config} = state]) do
    case config[:hosts] do
      [_ | _] = nodes ->
        for node <- nodes, do: connect!(node, state)
        {:ok, state}

      [] ->
        :ignore

      nil ->
        :ignore
    end
  end

  defp connect!(node, state) when is_atom(node) do
    :ok = connect(node, state)
    true = Node.monitor(node, true)
    node
  end

  defp connect(node, state) when is_atom(node),
    do: Cluster.Strategy.connect_nodes(state.topology, state.connect, state.list_nodes, [node])

  def handle_info({:nodedown, node}, state) do
    Logger.error("Node down: #{node}")
    spawn_link(fn -> retry_connect(node, state, max_retries(state)) end)
    {:noreply, state}
  end

  defp retry_connect(node, state, retries_left)
       when is_integer(retries_left) and retries_left >= 0 do
    case connect(node, state) do
      :ok ->
        :ok

      {:error, _} ->
        :timer.sleep(retry_delay(state))
        retry_connect(node, state, retries_left - 1)
    end
  end

  defp retry_connect(node, state, _),
    do:
      raise(
        "Can't connect to the contact node #{node} after #{max_retries(state)} retries, terminating."
      )

  defp max_retries(%{config: config}),
    do: Keyword.get(config, :max_retries, @default_max_retries)

  defp retry_delay(%{config: config}),
    do: Keyword.get(config, :retry_delay, @default_retry_delay)
end

I'd like to hear from you if this is a good idea. If yes, is there a way we can contribute it back to the library?

qhwa avatar Sep 23 '21 12:09 qhwa

We've just moved from peerage to libcluster and were surprised to find out that the epmd strategy doesn't attempt to reconnect. I think the epmd strategy needs to support failure modes and reconnect.

BrendanBall avatar Jan 10 '22 11:01 BrendanBall

I can only agree. We can write our own strategy based on the existing one, but it would be a nice feature to have out of the box.

I'm open to improving the existing strategy, but time is always hard to find. If you have an existing implementation that you are using that you think is a strict improvement over the current one, and are willing to contribute that work, I'll certainly make time to get such a PR merged ASAP. I'll review the proposed implementation in the OP this week, so if that covers your use case, you can disregard the above.

bitwalker avatar Mar 18 '22 21:03 bitwalker

I'm open to improving the existing strategy, but time is always hard to find. If you have an existing implementation that you are using that you think is a strict improvement over the current one, and are willing to contribute that work, I'll certainly make time to get such a PR merged ASAP. I'll review the proposed implementation in the OP this week, so if that covers your use case, you can disregard the above.

Thanks for the reply @bitwalker I'll try to find a slot to make a PR for it.

qhwa avatar May 19 '22 14:05 qhwa

#183 has been merged, so this will be fixed in the next release (going out today)

bitwalker avatar Jun 22 '23 17:06 bitwalker