leveldb icon indicating copy to clipboard operation
leveldb copied to clipboard

[BUG] LevelDB data loss after a crash when deployed on GlusterFS

Open cns2022 opened this issue 2 years ago • 1 comments

Description

We run a simple workload on LevelDB that inserts two key-value pairs. The two inserts end up going to different log files, and the first insert is set as asynchronous.

The file system trace we observed is shown below:

1 append("3.log") # first insert
2 create("4.log")
3 close("3.log")
4 append("4.log") # second insert
5 fdatasync("4.log")

When deployed on GlusterFS, the first append (line 1) may return successfully, but the data fails to persist to disk. This is due to a common approach in distributed file system for write optimization, which delays write submission to server, and lie to application that write has finished without error.

When any failure happens during the write submission, GlusterFS will make close (line 3) return with -1 to propagate the error. However, since LevelDB doesn't check any error returned by close, it's not aware about any error happens during the first insert.

In GlusterFS, fdatasync("4.log") will only persist data on 4.log but not 3.log, therefore, if any crash happens after fsync (line 5), LevelDB will not recover the first insert after reboot.

As a consequence, there is data loss on the first insert, but not second insert, which violates the ordering guarantee provided by LevelDB.

Fix

To fix the problem, we could add error handling logic for close operation. Basically, when error happens, we should consider previous append as failed, and either redo it or call fsync on that specific log file to force the file system persist the write.

cns2022 avatar Jan 03 '23 23:01 cns2022