raft
raft copied to clipboard
Random test failure: "Somehow we have more logs then commands"
What are we doing to trigger the bug:
while (mix test test/fuzzy/partitions_test.exs:36); do sleep 1; done
What we expect:
When running the fuzzy/partitions_test repeatedly, we expect it to pass every time.
What happens:
Some runs trigger a test failure resembling the following (sometimes there is one Entry, sometimes two):
Logs we don't understand: [
%Raft.Log.Entry{data: {:put, -34}, index: 471, term: 3, type: :command},
%Raft.Log.Entry{data: {:put, -34}, index: 472, term: 3, type: :command}
]
1) test node shutdowns (Raft.Fuzzy.PartitionsTest)
test/fuzzy/partitions_test.exs:36
** (RuntimeError) Somehow we have more logs then commands wtf.
code: assert Cluster.all_logs_match(cluster, commands)
stacktrace:
(raft) test/support/cluster.ex:132: Raft.Support.Cluster.compare_logs/3
(raft) test/support/cluster.ex:123: Raft.Support.Cluster.missing_writes_on_server/2
(elixir) lib/enum.ex:1336: Enum."-map/2-lists^map/1-0-"/2
(elixir) lib/enum.ex:1336: Enum."-map/2-lists^map/1-0-"/2
(raft) test/support/cluster.ex:117: Raft.Support.Cluster.missing_writes/2
(raft) test/support/cluster.ex:107: Raft.Support.Cluster.verify_logs/2
test/fuzzy/partitions_test.exs:61: (test)
The following output was logged:
16:30:46.462 [error] Task #PID<0.232.0> started from :s0 terminating
** (stop) exited in: :gen_statem.call({:s3, :nonode@nohost}, %Raft.RPC.AppendEntriesReq{entries: [], from: {:s0, :nonode@nohost}, leader_commit: 1, leader_id: {:s0, :nonode@nohost}, prev_log_index: 1, prev_log_term: 1, term: 1, to: {:s3, :nonode@nohost}}, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(stdlib) gen.erl:228: :gen.do_for_proc/2
(stdlib) gen_statem.erl:619: :gen_statem.call_dirty/4
(raft) lib/raft/rpc.ex:80: Raft.RPC.do_send/1
(elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Function: #Function<2.92344542/0 in Raft.RPC.send_msg/1>
Args: []
Additional info:
We have added the following debug output to try to understand what precisely goes wrong during the failed test runs.
diff --git a/test/support/cluster.ex b/test/support/cluster.ex
index 25968bd..de65f99 100644
--- a/test/support/cluster.ex
+++ b/test/support/cluster.ex
@@ -87,6 +87,8 @@ defmodule Raft.Support.Cluster do
{s, data}
end
+ IO.inspect data, label: :data, limit: :infinity
+ IO.inspect commands, label: :commands, limit: :infinity
verify_terms(data) && verify_logs(data, commands)
end
We have also simplified the StreamData generator to make the command logs easier to read:
diff --git a/test/fuzzy/partitions_test.exs b/test/fuzzy/partitions_test.exs
index 6bee922..11e7bd4 100644
--- a/test/fuzzy/partitions_test.exs
+++ b/test/fuzzy/partitions_test.exs
@@ -10,7 +10,7 @@ defmodule Raft.Fuzzy.PartitionsTest do
}
def commands, do: one_of([
- tuple({constant(:put), term()}),
+ tuple({constant(:put), integer()}),
constant(:pop),
])
@@ -53,7 +53,8 @@ defmodule Raft.Fuzzy.PartitionsTest do
Cluster.restart(cluster, shutdown)
end
- {commands, _errors} = Applier.stop_applying(applier)
+ {commands, errors} = Applier.stop_applying(applier)
+ IO.inspect(errors, label: :errors)
IO.inspect(commands, label: "Commands after applying")
Cluster.stop(cluster)
Attached is the full output of a failed test, run with the above changes in place.
more-logs-than-commands-no-errors.log
Linking @studzien -- we worked on this together.
Thanks for the report! I'll see if I can figure out exactly what's going on. I've finally cleared off enough other projects to spend some time on this again.