leveldb
leveldb copied to clipboard
[BUG] LevelDB data loss after a crash when deployed on GlusterFS
Description
We run a simple workload on LevelDB that inserts two key-value pairs. The two inserts end up going to different log files, and the first insert is set as asynchronous.
The file system trace we observed is shown below:
1 append("3.log") # first insert
2 create("4.log")
3 close("3.log")
4 append("4.log") # second insert
5 fdatasync("4.log")
When deployed on GlusterFS, the first append (line 1) may return successfully, but the data fails to persist to disk. This is due to a common approach in distributed file system for write optimization, which delays write submission to server, and lie to application that write has finished without error.
When any failure happens during the write submission, GlusterFS will make close
(line 3) return with -1
to propagate the error. However, since LevelDB doesn't check any error returned by close
, it's not aware about any error happens during the first insert.
In GlusterFS, fdatasync("4.log")
will only persist data on 4.log
but not 3.log
, therefore, if any crash happens after fsync
(line 5), LevelDB will not recover the first insert after reboot.
As a consequence, there is data loss on the first insert, but not second insert, which violates the ordering guarantee provided by LevelDB.
Fix
To fix the problem, we could add error handling logic for close operation. Basically, when error happens, we should consider previous append as failed, and either redo it or call fsync on that specific log file to force the file system persist the write.