commanded-swarm-registry
                                
                                 commanded-swarm-registry copied to clipboard
                                
                                    commanded-swarm-registry copied to clipboard
                            
                            
                            
                        Not all projections are running correctly
I have a setup with libcluster and three cluster nodes, and I sometimes get errors like the following.
18:59:50.813 [warn]  [swarm on app@server] [tracker:handle_topology_change] handoff failed for {CommandedApplication, Commanded.Event.Handler, "Projectors.Object"}: {{%RuntimeError{message: "attempted to call GenServer #PID<0.3862.0> but no handle_call/3 clause was provided"}, [{Commanded.Event.Handler, :handle_call, 3, [file: 'lib/gen_server.ex', line: 773]}, {:gen_server, :try_handle_call, 4, [file: 'gen_server.erl', line: 661]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 690]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}, {GenServer, :call, [#PID<0.3862.0>, {:swarm, :begin_handoff}, 5000]}}
18:59:50.813 [error] GenServer #PID<0.3487.0> terminating
** (RuntimeError) attempted to call GenServer #PID<0.3862.0> but no handle_call/3 clause was provided
    (commanded 1.2.0) lib/gen_server.ex:773: Commanded.Event.Handler.handle_call/3
     (stdlib 3.12.1) gen_server.erl:661: :gen_server.try_handle_call/4
    (stdlib 3.12.1) gen_server.erl:690: :gen_server.handle_msg/6
    (stdlib 3.12.1) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<0.1551621646.1983119361.7440>, :process, #PID<0.3862.0>, {%RuntimeError{message: "attempted to call GenServer #PID<0.3862.0> but no handle_call/3 clause was provided"}, [{Commanded.Event.Handler, :handle_call, 3, [file: 'lib/gen_server.ex', line: 773]}, {:gen_server, :try_handle_call, 4, [file: 'gen_server.erl', line: 661]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 690]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}}
I'm not sure how to debug this. This means that the projectors stop working. When I restart a node they start again but then I get the same kinds of errors and not all projectors are running.
I don't really care if there are errors in the log, but I want my projectors to eventually be running, even if it takes a few restarts when taking a node down or up.
My setup looks like this: in my application I start the projectors as children to my application supervisor. Is there a good way to make sure that my projections are not going away when restarting a node? Should I look for something in particular in the logs?
The handle_call/3 function which processes the {:swarm, :begin_handoff} message is implemented in the swarm registry module.  I don't know why the error indicates it was not provided.
Have you tried using Commanded Horde registry? I think Horde is more reliable than Swarm.
Yes I was using it before, but it seems not to be updated for commanded 1.2.0, it requires 1.0.0. Perhaps it's just a matter of updating its dependency, I'll try that.
I still have not gotten this to work. What I want is to have a cluster, configured using libcluster, where I can take nodes down and have the projections restart on another node when that happens.
Is this such an uncommon use case?
I don't care about the aggregates, they can run locally on each node. But projections should only run as singletons.
I've also tried making my own registry using singleton, see below. and this works fine when I run it on a single node. But when i run it on multiple nodes I get no messages, just {"Kernel pid terminated",application_controller,"{application_terminated,singleton,shutdown}"}
I'm out of ideas. Any help would be much appreciated.
defmodule CommandedSingletonRegistry do
  @moduledoc """
  Like LocalRegistry, but event handlers are using Singleton.
  This means that aggregates are local, but event handlers are global.
  """
  @behaviour Commanded.Registration.Adapter
  @impl Commanded.Registration.Adapter
  def child_spec(application, config) do
    Commanded.Registration.LocalRegistry.child_spec(application, config)
  end
  @impl Commanded.Registration.Adapter
  def start_child(adapter_meta, name, supervisor, child_spec) do
    Commanded.Registration.LocalRegistry.start_child(adapter_meta, name, supervisor, child_spec)
  end
  @impl Commanded.Registration.Adapter
  def start_link(_adapter_meta, name, module, args, _start_options) do
    Singleton.start_child(module, args, name)
  end
  def whereis_name(adapter_meta, name) do
    Commanded.Registration.LocalRegistry.whereis_name(adapter_meta, name)
  end
  @impl Commanded.Registration.Adapter
  def via_tuple(adapter_meta, name) do
    Commanded.Registration.LocalRegistry.via_tuple(adapter_meta, name)
  end
end
@norpan I don’t use distributed Erlang so don’t have much experience with running it.
I use the feature/distributed branch of the Conduit sample repo to test with multiple nodes by starting and stopping them. This is configured to use the :global registry with Commanded. In my testing I see the event handlers get restarted on a new node whenever the node hosting them is stopped.
It’s worth noting that you can safely run the same event handler on multiple nodes and are guaranteed that only one instance will actually process events. Even using Commanded’s local registry. If the running instance fails then one of the other instances will takeover, after a short delay. You don’t need to use distributed Erlang to have “singleton” event handling.
Distributed Erlang is designed to make it more efficient by only starting a single instance of each event handler, regardless of how many nodes are connected to the cluster.
Yes, I know I tried that (just running on all nodes) in the past, but I had some problems with consistency: :strong. Perhaps it's no longer an issue, I will try that, thanks!
For strong command dispatch consistency you need to configure Commanded to use the Phoenix PubSub adapter.
In a distributed Erlang setup you can use PG2 via the Phoenix.PubSub.PG2 adapter. With a multi-node deployment but not using distributed Erlang  you will need to use Redis and the Phoenix.PubSub.Redis adapter.
@norpan Have you been able to solve this issue or any new insights to it? We experience the same issues on our system and as you said, it's hard to debug.
@norpan Have you been able to solve this issue or any new insights to it? We experience the same issues on our system and as you said, it's hard to debug.
Yes following the setup in https://github.com/commanded/commanded/blob/master/guides/Deployment.md#multi-node-cluster-deployment works for us. A crucial thing is the :max_restarts setting.
I've ran into this issue as well and did some investigation and I am pretty sure where the problem is.
On the lib/commanded/event/handler.ex file we have:
defmodule Commanded.Event.Handler do
  use GenServer
  use Commanded.Registration
  ...
end
And going deeper on the lib/commanded/registration.ex file we have:
@doc """
Use the `Commanded.Registration` module to import the registry adapter and
via tuple functions.
"""
defmacro __using__(_opts) do
  quote location: :keep do
    @before_compile unquote(__MODULE__)
    alias unquote(__MODULE__)
  end
end
@doc """
Allow a registry adapter to handle the standard `GenServer` callback
functions.
"""
defmacro __before_compile__(_env) do
  quote generated: true, location: :keep do
    @doc false
    def handle_call(request, from, state) do
      adapter = registry_adapter(state)
      adapter.handle_call(request, from, state)
    end
    @doc false
    def handle_cast(request, state) do
      adapter = registry_adapter(state)
      adapter.handle_cast(request, state)
    end
    @doc false
    def handle_info(msg, state) do
      adapter = registry_adapter(state)
      adapter.handle_info(msg, state)
    end
    defp registry_adapter(state) do
      application = Map.get(state, :application)
      {adapter, _adapter_meta} = Application.registry_adapter(application)
      adapter
    end
  end
end
In theory, what these macros are doing is just appending to the bottom of the Commanded.Event.Handler module the handle_* functions. But in reality, these are getting overwritten by the default handle_* definitions from the GenServer module during compilation.
Here is another concise example. Consider the following modules:
defmodule MyCoolMacro do
  defmacro __using__(_opts) do
    quote location: :keep do
      @before_compile unquote(__MODULE__)
    end
  end
  defmacro __before_compile__(_env) do
    quote generated: true, location: :keep do
      def handle_call(_, _, _) do
        {:foo, "bar"}
      end
    end
  end
end
defmodule MyCoolModule do
  use MyCoolMacro
end
If we open up an IEx shell, we get the following:
iex(1)> MyCoolModule.handle_call(nil, nil, nil)
{:foo, "bar"}
As expected, we get the result of the handle_call function that we defined in our macro. But now, lets try adding the use GenServer to our module:
defmodule MyCoolModule do
  use GenServer
  use MyCoolMacro
end
And now, if we open up an IEx shell, we get the same behaviour as we are seeing with Commanded:
iex(1)> MyCoolModule.handle_call(nil, nil, nil)
** (RuntimeError) attempted to call GenServer #PID<0.158.0> but no handle_call/3 clause was provided
    (foo_bar 0.1.0) lib/gen_server.ex:779: MyCoolModule.handle_call/3
@slashdotdash Any suggestions on to solve this correctly? Just adding the functions to the __using__ seem to break other parts of the application where the use Commanded.Registration is used.