Cole Miller
Cole Miller
It seems what happens here is that our node believes that an open segment `open-1` exists, and contains the entries at indices from 2 to 55. It decides to truncate...
@jerrytesting We've got a separate issue tracking that one: canonical/raft#386.
Mathieu's logs tell a sort of similar story, in the sense that a bunch of files go missing. On n3, at line 30245 we have ``` LIBRAFT 1692019474892072099 src/uv_list.c:92 segment...
This bit in between is also interesting: ``` LIBRAFT 1692019482619984787 src/replication.c:1055 log mismatch -> truncate (2011) LIBRAFT 1692019482619987887 src/uv_truncate.c:163 uv truncate 2011 LIBRAFT 1692019482619991088 src/uv_append.c:839 UvBarrier uv->append_next_index:2011 LIBRAFT 1692019482620014288 src/uv_append.c:622...
In jerrytesting's logs I see corrupt segments in the lists, and it's possible something different is going on.
Okay, so in Mathieu's logs, the issue is that Jepsen tries to remove n3 from the cluster (`shrink!`), which involves wiping the data directory, but the dqlite/raft process doesn't stop...
Happened again: - [Jepsen run](https://github.com/canonical/jepsen.dqlite/actions/runs/5959659461/job/16165753261) - [Artifact](https://github.com/canonical/raft/files/12429011/jepsen-data-append-partition.member-failure.zip)
I think this and canonical/jepsen.dqlite#125 are two manifestations of the same problem, that fatal signals are not reliably causing the jepsen.dqlite app to shut down reliably.
what will be involved in this: - get rid of the `--enable-backtrace` switch in configure.ac, don't try to link libbacktrace - in configure.ac, check for the `execinfo.h` header and use...
I can't assign the issue to him but @letFunny is working on this