swarm icon indicating copy to clipboard operation
swarm copied to clipboard

Swarm gets blocked/deadlocked during tests

Open sashaafm opened this issue 5 years ago • 3 comments

Hello @bitwalker, we've been using Swarm for some time now and recently we started having a problem where our test suite seems to get blocked or deadlocked when it runs. It always get stuck at at the same point, printing the same logs and getting blocked there.

Below are the logs which are printed before the suite getting stuck:

[warn] [swarm on [email protected]] [tracker:start_pid_remotely] failed to start {:tracker, "user_id", "acc_id"} on nonode@nohost: nodedown, retrying operation.. pid=<0.3704.0> application=swarm module=Swarm.Logger function=warn/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=15 
[error] [swarm on [email protected]] [tracker:start_pid_remotely] ** (ErlangError) Erlang error: {:function_clause, {:gen_statem, :call, [{Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, {:track, {:tracker, "user_id", "acc_id"}, %{mfa: {Presence.Users.Swarm, :start, [{:user, "user_id", "acc_id"}, {:controller, "user_id", "acc_id"}, {:user_supervisor, "user_id", "acc_id"}, {:tracker, "user_id", "acc_id"}, %Presence.Models.User{account_id: "acc_id", user_id: "user_id"}]}}}, :infinity]}}
    (stdlib) gen.erl:253: :gen.do_for_proc({Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, #Function<0.94381411/1 in :gen.call/4>)
    (stdlib) gen_statem.erl:591: :gen_statem.call_dirty/4
    (swarm) lib/swarm/tracker/tracker.ex:1115: Swarm.Tracker.start_pid_remotely/6
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
 pid=<0.3704.0> application=swarm module=Swarm.Logger function=error/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=27 
[error] Task #PID<0.3704.0> started from Swarm.Tracker terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {:error, {:invalid_ring, :no_nodes}}. This protocol is implemented for: Atom, BitString, Date, DateTime, Float, Integer, List, NaiveDateTime, Time, URI, Version, Version.Requirement
    (elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:3: String.Chars.impl_for!/1
    (elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:22: String.Chars.to_string/1
    (swarm) lib/swarm/tracker/tracker.ex:1152: Swarm.Tracker.start_pid_remotely/6
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Function: #Function<1.98420104/0 in Swarm.Tracker.do_track/2>
    Args: [] pid=<0.3704.0> error_logger=format 

So far we've got no idea why it is getting stuck. Even worse, I had a hunch about what could be causing the problem. I altered the code I thought was problematic and the suite no longer got stuck (locally, running in a Docker container). However, the same problem seems to persist in our CI pipeline, which runs the same exact Docker image and test script. I've even executed the tests with the --trace flag, which serialises all tests and should prevent deadlocks (I think).

I'm not sure if this is a bug in Swarm, or how we should proceed to try to dig deeper. I also find it quite odd that ExUnit isn't timing out... Maybe it's getting stuck between test cases?

Also what could cause such behaviour?

sashaafm avatar Jul 30 '18 17:07 sashaafm

The :no_nodes error seems to me that the internal ring does not have any nodes to connect to. Are you using a blacklist or whitelist in your tests?

arjan avatar Jul 30 '18 18:07 arjan

We do have a blacklist:

config :swarm,
  node_blacklist: [
    ~r/^primary@.+$/
  ]

I've noticed these exact logs also pop up in other places during the tests and they also get stuck for a little while. However, they soon pass and the tests continue, while this other test case keeps getting stuck for ever.

sashaafm avatar Jul 31 '18 09:07 sashaafm

Could you check my branch on PR #94? it should fix all current issues with the test cases.

arjan avatar Jul 31 '18 10:07 arjan