Swarm tracker clock gets weird, eats memory
We're running a small application in our kubernetes cluster and we're encountering a few problems.
Specifically, we've got nodes that will deadlock on startup like #91 and clocks getting out of sync left and right as in https://github.com/bitwalker/swarm/issues/21#issuecomment-378893288,
These problems don't occur reliably, and I haven't been able to reproduce them locally or even with an isolated version of the application in the production cluster, so tracking them down and figuring out what is going on has proven so far impossible.
One thing that we have reliably noticed is that the tracker clocks get kinda wild. See here:
iex([email protected])1> :sys.get_state(Swarm.Tracker)
{:tracking,
%Swarm.Tracker.TrackerState{
clock: {{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {...}}}}}}}}}}}}}}}}}}}}}}}},
0},
nodes: [:"[email protected]", :"[email protected]"],
pending_sync_reqs: [],
self: :"[email protected]",
strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]"]>,
sync_node: nil,
sync_ref: nil
}}
That example is after the node had been up approximately 8 minutes.
I killed a node and reconnected as soon as it was available (~10s or so) and the state looked like this:
iex([email protected])1> :sys.get_state(Swarm.Tracker)
{:tracking,
%Swarm.Tracker.TrackerState{
clock: {{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0,
{0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {...}}}}}}}}}}}}}}}}}}}}}}}},
0},
nodes: [:"[email protected]", :"[email protected]"],
pending_sync_reqs: [],
self: :"[email protected]",
strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]"]>,
sync_node: nil,
sync_ref: nil
}}
The other nodes were none to happy about this restart:
21:30:24.493 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.586 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "lead:26951:f5nil3ngubszyxp9"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.658 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2314846:1617034e75bf1637"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.699 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.763 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:25.026 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:25.036 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2268963:ry8i9l30ncx9gsbm"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:25.425 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2302935:elgx5nl1mdnciy6m"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:41:06.042 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2271277:wjtwl0qpbn1eeyzk"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:41:06.437 [warn] [swarm on [email protected]] [tracker:handle_call] failed to execute remote get_by_name on :"[email protected]": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2271277:y5j7xaxbqvabkwd3"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
We're running 7ed7b2 right now and haven't been able improve the behaviour of the application.
Is anyone able to shed some light on these issues?
I've encountered some weird clock drift as well and can reproduce somewhat reliably. It is an issue what @bitwalker and I are aware of and are working to address.