Dirty WAL log detection and recovery

Open MarinPostma opened this issue 2 years ago • 1 comments

I we ever fail to commit to the shadow WAL after we already have committed to sqlite wal, then we end up with a dirty WAL, and abort the process. On startup, we need to detect that and patch the shadow WAL to be consistent with the sqlite database.

May 24 '23 12:05 MarinPostma

I think a good way to implement dirty WAL detection would be the following: The header of the WAL contains a triplet of numbers (x1, x2, x3). On startup, when the WAL file is first opened, we check that x1 == x2 == x3. If this doesn't hold, we enter recovery mode and attempt to recover the WAL from the database. This invalidates replication, which will also potentially need repair (repairing is out of the scope of what I'm describing). If x1 == x2 == x3 indeed holds, then we update the triplet to (x1 + 1, x2, x3).

On a clean shutdown, we perform the following operation:

set x2 = x1 (basically increment x2)
fsync the WAL file, ensuring that all previous writes are indeed on disk
set x3 = x1
flush once again. This last flush ensures that all writes preceding 1) have effectively been flush to disk when we first set x2 = x1. This is necessary because there are no guarantees what order the pages will be written back to disk in case of failure. There could be a case where the header was updated, but previous pages were not flushed yet.

May 28 '23 19:05 MarinPostma