swarm
swarm copied to clipboard
Unrecognized cast: {:untrack, pid}
I'm not sure if it actually causes a problem or not, but my logs sometimes show lines like these
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.583.0>}
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.TaskRegistry, #PID<39255.580.0>, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: MyApp.Scheduler.TaskRegistry}]}}}
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.584.0>}
2019-02-25 23:55:35.677 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.JobBroadcaster, #PID<39255.581.0>, %{mfa: {Quantum.JobBroadcaster, :start_link, [%Quantum.JobBroadcaster.StartOpts{debug_logging: true, jobs: [...jobs here...]}}}
Taken from https://github.com/quantum-elixir/quantum-core/issues/398
My guess is that something is sending the :track and :untrack messages as casts instead of calls. This line expects the message to be a call, and this line sends the message as a call.
This line, however, sends the message as a cast so my guess is this is the culprit. Perhaps simply changing it to a call will fix the issue?
I'm on swarm 3.4.0.
fwiw, I also see these messages when running mix test on swarm. Specifically
mix test test/integration_test.exs:63
And the test passes. When I change the cast to a call the messages are gone, but the test fails.
Also, interestingly, when I delete the lines here, the tests still pass.
More observations
In the passing case, I see the following logs
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<7155.141.0>,
{{0,{0,1}},0},
[{entry,{worker,8},
<7155.158.0>,#Ref<7155.3024736589.70516738.97011>,
#{group1 => true,
mfa => {'Elixir.MyApp.Worker',start_link,[]}},
{0,{0,0,1}}},
...
{0,{0,0,1}}}]} in state syncing => tracking
*DBG* 'Elixir.Swarm.Tracker' postpone cast {untrack,<0.158.0>} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<7155.141.0>,
{{0,{0,1}},0},
[{entry,{worker,8},
<7155.159.0>,#Ref<7155.608380958.70516737.97926>,
#{group1 => true,
mfa => {'Elixir.MyApp.Worker',start_link,[]}},
{0,{0,2,0}}},
...
{0,{0,2,0}}}]} in state awaiting_sync_ack
08:06:14.211 [info] [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<7155.141.0>,{0,{0,0,2}},{untrack,<7155.158.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<7155.141.0>,
{{0,{0,1}},0},
[{entry,{worker,8},
<7155.159.0>,#Ref<7155.608380958.70516737.97926>,
#{group1 => true,
mfa => {'Elixir.MyApp.Worker',start_link,[]}},
{0,{0,2,0}}},
...
{0,{0,2,0}}}]} in state awaiting_sync_ack => tracking
Note that both nodes receive a sync_ack and the untrack is "postponed".
In the failing case, I see the following logs.
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<7155.143.0>,
{{{0,1},0},0},
[{entry,{worker,8},
<7155.160.0>,#Ref<7155.310440809.337641473.52450>,
#{group1 => true,
mfa => {'Elixir.MyApp.Worker',start_link,[]}},
{0,{0,1,0}}},
...
{0,{0,1,0}}}]} in state syncing
08:02:14.404 [info] [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
*DBG* 'Elixir.Swarm.Tracker' receive call {untrack,<0.160.0>} from <7155.143.0> in state awaiting_sync_ack
Note that this time no sync_ack messages appear to be received, but the untrack is received. I'm guessing that the untrack message somehow prevents the sync_ack messages from being sent.
It turns out, the reason why changing the cast to a call doesn't work is because it causes a deadlock. The pull request I submitted does not change it to a call, but just tries to make the cast recognized.
@jesseshieh Hi, we're also seeing this error in our app, and it causes some workers to not be properly registered. It seems to happen when nodes lose connection with each other (maybe due to libcluster's Kubernetes strategy not being totally stable?), and try to reconnect again.
Did your patch fix the issue? If so we'll just host the patched version in our repo and point the library to that version.
Thanks.
I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.
I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.
@jesseshieh what did you use instead?
@bernardo-martinez I still saw the issue on every deployment. I ended up using horde as a replacement.