swarm icon indicating copy to clipboard operation
swarm copied to clipboard

target module not available on remote node

Open alexferreira opened this issue 6 years ago • 9 comments

I have a service calculator and the investigator running through the libcluster.

but when I register my service, sometimes something strange happens.

for example:

When I start the calculator service it returns me the following message.

[warn] [swarm on [email protected]] [tracker: start_pid_remotely] "a475b420-e5f8-4528-9b72-766b7e75d177" could not be started on [email protected]: target module not available on remote node, retrying operation after 1000ms ..

and in the investigator service I get the following return

[warn] [swarm on [email protected]] [tracker: do_track] ** (UndefinedFunctionError) function Calculator.Supervisor.register / 1 is undefined (module Calculator.Supervisor is not available)
    Calculator.Supervisor.register ("a475b420-e5f8-4528-9b72-766b7e75d177")
    (swarm) lib / swarm / tracker / tracker.ex: 1082: Swarm.Tracker.do_track / 2
    (stdlib) gen_statem.erl: 1660:: gen_statem.call_state_function / 5
    (stdlib) gen_statem.erl: 1023:: gen_statem.loop_event_state_function / 6
    (stdlib) proc_lib.erl: 249:: proc_lib.init_p_do_apply / 3

but if I try to start it sometimes it works without problems.

can anybody help me?

alexferreira avatar Jan 11 '19 20:01 alexferreira

To me this looks like if you have a cluster with heterogenous OTP apps. For swarm to work, the OTP application that you are going distribute processes for (e.g. with Swarm.register_name/4) need to be available on all the nodes participating in the swarm cluster.

arjan avatar Jan 11 '19 20:01 arjan

@arjan this is happening soon after running Swarm.register_name/4 only after a few times it works.

alexferreira avatar Jan 11 '19 20:01 alexferreira

So are the same OTP applications started on both nodes?

arjan avatar Jan 11 '19 20:01 arjan

yes the same applications were started in both nos.

It's working right now. however if I stop one of the applications and start again many times the problem mentioned above happens.

alexferreira avatar Jan 11 '19 20:01 alexferreira

Do you mean stopping the node or just stopping the application? (Application.stop)? Maybe the cluster is already formed before all application code is loaded, and tracker requests come in already, however I cannot imagine that this takes very long...

arjan avatar Jan 11 '19 21:01 arjan

in this first gif as you can see I started the applications and soon came the error quoted

swarm

in the second gif as you can see the error does not happen.

swarm1

alexferreira avatar Jan 11 '19 21:01 alexferreira

The problem seems to be that the second node is still loading code when Swarm on the first node tells Swarm on the second node to start a process (resulting in the crash, because the code isn't loaded yet). This is happening because when running with Mix, applications and their code are loaded and started sequentially, while in a release, all application code is first loaded, then applications are started.

My guess is that Mix starts Swarm before it starts the part of the system which invokes register_name, so Swarm on the second node starts and is able to communicate with the first node and accept registration requests before the code for the registration callback is loaded - since this is inherently racy, that's why it works only some of the time.

@arjan @beardedeagle Until we get the refactoring implemented so that Swarm can be started under the supervision tree rather than as its own tree, we could provide a configuration option which allows specifying an application that needs to be started before Swarm will start serving requests, and then basically just loop until the application status (via :application_controller.info/0) shows that it is started. Thoughts? The refactor is really the fix, but having a short term solution to this would be nice.

bitwalker avatar Jan 17 '19 00:01 bitwalker

@bitwalker I circumvented the situation using the dynamicSypervisor.

alexferreira avatar Jan 18 '19 21:01 alexferreira

@bitwalker I think that's a workable temp solution, though I'd take it a step further and allow it to accept a list of applications.

beardedeagle avatar Jan 18 '19 21:01 beardedeagle