expert icon indicating copy to clipboard operation
expert copied to clipboard

Don't use EPMD.

Open mhanberg opened this issue 3 months ago • 4 comments

We shouldn't need to use EPMD for clustering as we are always aware of the nodes.

@josevalim has had similar problems with livebook related to EPMD and has fixed them with a epmdless style approach.

We should be able to leverage that approach in expert.

Should help with problems like the sub issues

mhanberg avatar Sep 02 '25 15:09 mhanberg

Perhaps the steps here will be enough. You start the main one with -start_epmd false -erl_epmd_port 6789 and each individual instance you start with -start_epmd false -erl_epmd_port 6789 -dist_listen false:

https://hexdocs.pm/mix/Mix.Tasks.Release.html#module-epmd-less-deployment

josevalim avatar Sep 02 '25 16:09 josevalim

@josevalim do you know if this should work when many instances of the application? My intuition is telling me that the erl_epmd_port might need to be unique for each instance

mhanberg avatar Sep 04 '25 15:09 mhanberg

Yes, it should. The parent node is listening on port 6789 and each child node connects to port 6789 but does not listen to any (-dist_listen false). This requires, however, for the child node to be the one connecting to the parent.

josevalim avatar Sep 05 '25 05:09 josevalim

Oh, sorry, I believe I have misunderstood your question. You want to run multiple Expert servers, not multiple clients, then you are right!

In order to do that, you can write this code:

defmodule Expert.EPMD do
  def dist_port do
    :persistent_term.get(:expert_dist_port, nil)
  end

  # EPMD callbacks

  def register_node(name, port), do: register_node(name, port, :inet)

  def register_node(name, port, family) do
    :persistent_term.put(:expert_dist_port, port)

    # We don't care if EPMD is not running
    case :erl_epmd.register_node(name, port, family) do
      {:error, _} -> {:ok, -1}
      {:ok, _} = ok -> ok
    end
  end

  defdelegate start_link(), to: :erl_epmd
  defdelegate port_please(name, host), to: :erl_epmd
  defdelegate port_please(name, host, timeout), to: :erl_epmd
  defdelegate listen_port_please(name, host), to: :erl_epmd
  defdelegate address_please(name, host, family), to: :erl_epmd
  defdelegate names(host_name), to: :erl_epmd
end

This is basically a wrapper around EPMD so you get to know which port it is running on. Now, in your vm.args, you will set this: -epmd_module Elixir.Expert.EPMD.

Finally, when starting the clients, you will do this:

if port = Expert.EPMD.dist_port() do
  start_client("-start_epmd false -erl_epmd_port #{port} -dist_listen false")
else
  # start as today
end

josevalim avatar Sep 05 '25 05:09 josevalim

@josevalim do you know if the method you described here (which uses the :erl_epmd module) is still "epmd-less" ?

I want to confirm that the :erl_epmd module does not need epmd itself.

mhanberg avatar Sep 27 '25 19:09 mhanberg

The clients no longer talk to the server via epmd, they connect directly to the port. I think the server still tries to register itself in epmd (but I think it still runs if it can’t register).

josevalim avatar Sep 27 '25 20:09 josevalim

That was a lie. I had to push one small change to the code above (now edited). Now it works. Here are the receipts.

Save the Expert.EPMD above to epmdless.ex and run this command:

$ killall epmd
$ elixirc epmdless.ex
$ iex --erl "-start_epmd false -epmd_module Elixir.Expert.EPMD" --sname foo
Erlang/OTP 27 [erts-15.1.2] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit]

Interactive Elixir (1.20.0-dev) - press Ctrl+C to exit (type h() ENTER for help)
iex(foo@macstudio)1> Expert.EPMD.dist_port
65378
iex(foo@macstudio)2>

Keep the node running but take note of the port above (you will get a different number). Now in another terminal, using the PORT from IEx above:

$ iex  --erl "-start_epmd false -erl_epmd_port PORT -dist_listen false" --sname bar

They should connect! To prove it doesn't use epmd, in another terminal:

$ epmd -names
epmd: Cannot connect to local epmd

josevalim avatar Sep 27 '25 20:09 josevalim

In practice, you only need to start the Expert server passing -epmd_module Elixir.Expert.EPMD in vm.args. The current implementation tries to register itself in epmd but, if it cannot find anything, it simply does not care. But you can also use "-start_epmd false -epmd_module Elixir.Expert.EPMD".

But the client node always needs the three flags above: -start_epmd false -erl_epmd_port PORT -dist_listen false.

I will be glad to jump into a call if you folks need me to.

josevalim avatar Sep 27 '25 20:09 josevalim

Thanks José, I'll try this out.

mhanberg avatar Sep 27 '25 23:09 mhanberg

For those getting RPC errors, like in #59 (I sometimes do).

I find it's because of some epmd process that is still running, but shouldn't be

Kill it

Image

(I would have posted this in the linked issue, but it's locked.)

ryanwinchester avatar Oct 08 '25 13:10 ryanwinchester

Here is the updated version as required:

defmodule Expert.NodePortMapper do
  use GenServer

  @name __MODULE__

  def start_link(_) do
    GenServer.start_link(__MODULE__, :ok, name: @name)
  end

  defp parent_node do
    if parent_node = System.get_env("EXPERT_PARENT_NODE") do
      String.to_atom(parent_node)
    else
      node()
    end
  end

  def register() do
    GenServer.call({@name, parent_node()}, {:register, node(), Expert.EPMD.dist_port()})
  end

  def get_port(node) do
    GenServer.call({@name, parent_node()}, {:get_port, node})
  end

  def init(:ok) do
    {:ok, %{}}
  end

  def handle_call({:register, node, port}, _from, state) do
    :erlang.monitor_node(node, true)
    {:reply, :ok, Map.put(state, node, port)}
  end

  def handle_call({:get_port, node}, _from, state) do
    {:reply, Map.get(state, node), state}
  end

  def handle_info({:nodedown, node}, state) do
    {:noreply, Map.delete(state, node)}
  end
end

defmodule Expert.EPMD do
  @moduledoc false

  # From Erlang/OTP 23+
  @epmd_dist_version 6

  @doc ~S"""
  This is the distribution port of the current node.

  The parent node must be named `expert_parent_*`.
  The child node must be named `expert_child_*`.

  When the parent boots the child, it must pass
  its node name and port as the respective environment
  variables `EXPERT_PARENT_NODE` and `EXPERT_PARENT_PORT`.

  The parent must have this as a child in its supervision tree:

      {Expert.NodePortMapper, []}

  The child, in turn, must have this:

      {Task, &Expert.NodePortMapper.register/0}

  This will register the child within the parent, so they can
  find each other.

  ## Example

  In order to manually simulate the connections, run `elixirc epmd.ex` to compile
  this file and follow the steps below. Notice we call the functions in the
  `Expert.NodePortMapper` module directly, while in practice they will be called
  as part of the app's supervision tree.

      # In one node
      $ iex --erl "-start_epmd false -epmd_module Elixir.Expert.EPMD" --sname expert_parent_foo
      iex(expert_parent_foo@macstudio)> Expert.NodePortMapper.start_link([])
      iex(expert_parent_foo@macstudio)> Expert.EPMD.dist_port()
      52914

  Get the port name from the step above and then, in another terminal, do:

      $ EXPERT_PARENT_NODE=expert_parent_foo@macstudio EXPERT_PARENT_PORT=52914 \
          iex --erl "-start_epmd false -epmd_module Elixir.Expert.EPMD" --sname expert_child_bar
      iex> Expert.NodePortMapper.register()

  And in another terminal:

      $ EXPERT_PARENT_NODE=expert_parent_foo@macstudio EXPERT_PARENT_PORT=52914 \
          iex --erl "-start_epmd false -epmd_module Elixir.Expert.EPMD -expert parent_port 52914" --sname expert_child_baz
      iex> Expert.NodePortMapper.register()

  If you try `Node.ping(:expert_child_bar@HOSTNAME)` from the last node, it should work.
  The child nodes will find each other even without EPMD.
  """
  def dist_port do
    :persistent_term.get(:expert_dist_port, nil)
  end

  # EPMD callbacks

  def register_node(name, port), do: register_node(name, port, :inet)

  def register_node(name, port, family) do
    :persistent_term.put(:expert_dist_port, port)

    # We don't care if EPMD is not running
    case :erl_epmd.register_node(name, port, family) do
      {:error, _} -> {:ok, -1}
      {:ok, _} = ok -> ok
    end
  end

  def port_please(name, host), do: port_please(name, host, :infinity)

  def port_please(~c"expert_parent_" ++ _ = name, host, timeout) do
    if port = System.get_env("EXPERT_PARENT_PORT") do
      {:port, String.to_integer(port), @epmd_dist_version}
    else
      :erl_epmd.port_please(name, host, timeout)
    end
  end

  def port_please(~c"expert_child_" ++ _ = name, host, timeout) do
    if port = Expert.NodePortMapper.get_port(List.to_atom(name)) do
      {:port, port, @epmd_dist_version}
    else
      :erl_epmd.port_please(name, host, timeout)
    end
  end

  def port_please(name, host, timeout) do
    :erl_epmd.port_please(name, host, timeout)
  end

  defdelegate start_link(), to: :erl_epmd
  defdelegate listen_port_please(name, host), to: :erl_epmd
  defdelegate address_please(name, host, family), to: :erl_epmd
  defdelegate names(host_name), to: :erl_epmd
end

The docs in the dist_port function in how to make it work. But in a nutshell, the parent node must be named expert_parent_WHATEVER. The children nodes must be named expert_child_WHATEVER. The parent node must start the children nodes passing the EXPERT_PARENT_NODE and EXPERT_PARENT_PORT as env vars, which can be fetched respectivaly as node() and Expert.EPMD.dist_port().

josevalim avatar Nov 09 '25 20:11 josevalim

Done in #205

doorgan avatar Nov 15 '25 17:11 doorgan