verdi-raft icon indicating copy to clipboard operation
verdi-raft copied to clipboard

Crash during update of snapshot causes loss of data

Open palmskog opened this issue 7 years ago • 0 comments

From @pfons on April 13, 2016 0:29

A crash of the server when it executed the function that writes a snapshot to disk (updating the existing snapshot) can cause loss of data and prevent the server from recovering correctly afterworlds.

This bug is more serious than issue #50 because it can lead to loss of data. Loss of data can happen because the server, when it crashes while executing the function save, deletes/truncates the existing disk snapshot before it safely writes the new snapshot to disk.

This problem can be reproduced by simulating a crash immediately after the snapshot file is opened with O_TRUNC (save function in Shim.ml) and before the write is actually made, for example, by adding the statement assert(env.saves < 10000);.

It is probably harder to fix this bug than issue #50 because a correct implementation needs to ensure that several steps (i.e., replacing the old snapshot with the new snapshot and truncating the log) are atomic despite crashes.

Copied from original issue: uwplse/verdi#39

palmskog avatar May 21 '17 03:05 palmskog