swarm icon indicating copy to clipboard operation
swarm copied to clipboard

Unrecognized cast: {:untrack, pid}

Open jesseshieh opened this issue 6 years ago • 7 comments
trafficstars

I'm not sure if it actually causes a problem or not, but my logs sometimes show lines like these

2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.583.0>}
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.TaskRegistry, #PID<39255.580.0>, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: MyApp.Scheduler.TaskRegistry}]}}}
2019-02-25 23:55:35.676 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.584.0>}
2019-02-25 23:55:35.677 [warn] [swarm on [email protected]] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.JobBroadcaster, #PID<39255.581.0>, %{mfa: {Quantum.JobBroadcaster, :start_link, [%Quantum.JobBroadcaster.StartOpts{debug_logging: true, jobs: [...jobs here...]}}}

Taken from https://github.com/quantum-elixir/quantum-core/issues/398

My guess is that something is sending the :track and :untrack messages as casts instead of calls. This line expects the message to be a call, and this line sends the message as a call.

This line, however, sends the message as a cast so my guess is this is the culprit. Perhaps simply changing it to a call will fix the issue?

I'm on swarm 3.4.0.

jesseshieh avatar May 14 '19 13:05 jesseshieh

fwiw, I also see these messages when running mix test on swarm. Specifically

mix test test/integration_test.exs:63

And the test passes. When I change the cast to a call the messages are gone, but the test fails.

Also, interestingly, when I delete the lines here, the tests still pass.

jesseshieh avatar May 14 '19 15:05 jesseshieh

More observations

In the passing case, I see the following logs


*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<7155.141.0>,
                {{0,{0,1}},0},
                [{entry,{worker,8},
                        <7155.158.0>,#Ref<7155.3024736589.70516738.97011>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,0,1}}},
...
                        {0,{0,0,1}}}]} in state syncing => tracking
*DBG* 'Elixir.Swarm.Tracker' postpone cast {untrack,<0.158.0>} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack
08:06:14.211 [info]  [swarm on [email protected]] [tracker:awaiting_sync_ack] received sync acknowledgement from [email protected], syncing with remote registry
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<7155.141.0>,{0,{0,0,2}},{untrack,<7155.158.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack => tracking

Note that both nodes receive a sync_ack and the untrack is "postponed".

In the failing case, I see the following logs.

*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<7155.143.0>,
                {{{0,1},0},0},
                [{entry,{worker,8},
                        <7155.160.0>,#Ref<7155.310440809.337641473.52450>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,1,0}}},
...
                        {0,{0,1,0}}}]} in state syncing
08:02:14.404 [info]  [swarm on [email protected]] [tracker:syncing] received registry from [email protected], merging..
*DBG* 'Elixir.Swarm.Tracker' receive call {untrack,<0.160.0>} from <7155.143.0> in state awaiting_sync_ack

Note that this time no sync_ack messages appear to be received, but the untrack is received. I'm guessing that the untrack message somehow prevents the sync_ack messages from being sent.

jesseshieh avatar May 14 '19 15:05 jesseshieh

It turns out, the reason why changing the cast to a call doesn't work is because it causes a deadlock. The pull request I submitted does not change it to a call, but just tries to make the cast recognized.

jesseshieh avatar May 16 '19 14:05 jesseshieh

@jesseshieh Hi, we're also seeing this error in our app, and it causes some workers to not be properly registered. It seems to happen when nodes lose connection with each other (maybe due to libcluster's Kubernetes strategy not being totally stable?), and try to reconnect again.

Did your patch fix the issue? If so we'll just host the patched version in our repo and point the library to that version.

Thanks.

x-ji avatar Feb 17 '20 14:02 x-ji

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

jesseshieh avatar Feb 17 '20 15:02 jesseshieh

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

@jesseshieh what did you use instead?

bernardo-martinez avatar Aug 04 '22 09:08 bernardo-martinez

@bernardo-martinez I still saw the issue on every deployment. I ended up using horde as a replacement.

huruk avatar Aug 29 '22 15:08 huruk