verdi-raft
verdi-raft copied to clipboard
Crash during update of snapshot causes loss of data
From @pfons on April 13, 2016 0:29
A crash of the server when it executed the function that writes a snapshot to disk (updating the existing snapshot) can cause loss of data and prevent the server from recovering correctly afterworlds.
This bug is more serious than issue #50 because it can lead to loss of data. Loss of data can happen because the server, when it crashes while executing the function save, deletes/truncates the existing disk snapshot before it safely writes the new snapshot to disk.
This problem can be reproduced by simulating a crash immediately after the snapshot file is opened with O_TRUNC (save function in Shim.ml) and before the write is actually made, for example, by adding the statement assert(env.saves < 10000);
.
It is probably harder to fix this bug than issue #50 because a correct implementation needs to ensure that several steps (i.e., replacing the old snapshot with the new snapshot and truncating the log) are atomic despite crashes.
Copied from original issue: uwplse/verdi#39