swarm Unrecognized cast: {:untrack, pid}

trafficstars

I'm not sure if it actually causes a problem or not, but my logs sometimes show lines like these

2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.583.0>}
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.TaskRegistry, #PID<39255.580.0>, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: MyApp.Scheduler.TaskRegistry}]}}}
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.584.0>}
2019-02-25 23:55:35.677 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.JobBroadcaster, #PID<39255.581.0>, %{mfa: {Quantum.JobBroadcaster, :start_link, [%Quantum.JobBroadcaster.StartOpts{debug_logging: true, jobs: [...jobs here...]}}}

Taken from https://github.com/quantum-elixir/quantum-core/issues/398

My guess is that something is sending the :track and :untrack messages as casts instead of calls. This line expects the message to be a call, and this line sends the message as a call.

This line, however, sends the message as a cast so my guess is this is the culprit. Perhaps simply changing it to a call will fix the issue?

I'm on swarm 3.4.0.

May 14 '19 13:05 jesseshieh

fwiw, I also see these messages when running mix test on swarm. Specifically

mix test test/integration_test.exs:63

And the test passes. When I change the cast to a call the messages are gone, but the test fails.

Also, interestingly, when I delete the lines here, the tests still pass.

May 14 '19 15:05 jesseshieh

More observations

In the passing case, I see the following logs


*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<7155.141.0>,
                {{0,{0,1}},0},
                [{entry,{worker,8},
                        <7155.158.0>,#Ref<7155.3024736589.70516738.97011>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,0,1}}},
...
                        {0,{0,0,1}}}]} in state syncing => tracking
*DBG* 'Elixir.Swarm.Tracker' postpone cast {untrack,<0.158.0>} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack
08:06:14.211 [info]  [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<7155.141.0>,{0,{0,0,2}},{untrack,<7155.158.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack => tracking

Note that both nodes receive a sync_ack and the untrack is "postponed".

In the failing case, I see the following logs.

*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<7155.143.0>,
                {{{0,1},0},0},
                [{entry,{worker,8},
                        <7155.160.0>,#Ref<7155.310440809.337641473.52450>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,1,0}}},
...
                        {0,{0,1,0}}}]} in state syncing
08:02:14.404 [info]  [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
*DBG* 'Elixir.Swarm.Tracker' receive call {untrack,<0.160.0>} from <7155.143.0> in state awaiting_sync_ack

Note that this time no sync_ack messages appear to be received, but the untrack is received. I'm guessing that the untrack message somehow prevents the sync_ack messages from being sent.

May 14 '19 15:05 jesseshieh

It turns out, the reason why changing the cast to a call doesn't work is because it causes a deadlock. The pull request I submitted does not change it to a call, but just tries to make the cast recognized.

May 16 '19 14:05 jesseshieh

@jesseshieh Hi, we're also seeing this error in our app, and it causes some workers to not be properly registered. It seems to happen when nodes lose connection with each other (maybe due to libcluster's Kubernetes strategy not being totally stable?), and try to reconnect again.

Did your patch fix the issue? If so we'll just host the patched version in our repo and point the library to that version.

Thanks.

Feb 17 '20 14:02 x-ji

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

Feb 17 '20 15:02 jesseshieh

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

@jesseshieh what did you use instead?

Aug 04 '22 09:08 bernardo-martinez

@bernardo-martinez I still saw the issue on every deployment. I ended up using horde as a replacement.

Aug 29 '22 15:08 huruk

swarm swarm copied to clipboard

Unrecognized cast: {:untrack, pid}

More observations

swarm
swarm copied to clipboard