tigerbeetle Crash 7195108919153214527

Trips this error:

// Abort if all slots are faulty, since something is very wrong.
if (self.journal.faulty.count == constants.journal_slot_count) return error.WALInvalid;

thread 812086 panic: attempt to unwrap error: WALInvalid
/home/batiati/tigerbeetle/src/vsr/replica.zig:616:76: 0x342923 in open (simulator)
            if (self.journal.faulty.count == constants.journal_slot_count) return error.WALInvalid;
                                                                           ^
/home/batiati/tigerbeetle/src/testing/cluster.zig:493:13: 0x347585 in replica_open (simulator)
            try replica.open(
            ^
/home/batiati/tigerbeetle/src/testing/cluster.zig:435:13: 0x3e2155 in restart_replica (simulator)
            try cluster.replica_open(replica_index, .{
            ^
/home/batiati/tigerbeetle/src/simulator.zig:967:17: 0x38b830 in restart_replica (simulator)
        ) catch unreachable;
                ^
/home/batiati/tigerbeetle/src/simulator.zig:923:34: 0x3e1aac in tick_crash_down (simulator)
        simulator.restart_replica(replica.replica, fault);
                                 ^
/home/batiati/tigerbeetle/src/simulator.zig:869:51: 0x38ae3c in tick_crash (simulator)
                .down => simulator.tick_crash_down(replica),
                                                  ^
/home/batiati/tigerbeetle/src/simulator.zig:503:29: 0x35a4ab in tick (simulator)
        simulator.tick_crash();

SEED 7195108919153214527

Mar 17 '24 09:03 batiati

I see this in the log:

[debug] (storage): 4: corrupting sector at zone=vsr.Zone.wal_headers offset=0 slots=0...16
[debug] (storage): 4: corrupting sector at zone=vsr.Zone.wal_headers offset=4096 slots=16...32

So, what happens here is that the VOPR corrupts two disk sectors, and they happen to be the two sectors we write headers to (in simulator, journal is short, so all headers occupy only two sectors)

What's interesting here is that both corruptions come from a torn write:

[debug] (storage): corrupting torn write on crash
[debug] (storage): 4: corrupting sector at zone=vsr.Zone.wal_headers offset=0 slots=0...16
[debug] (storage): corrupting torn write on crash
[debug] (storage): 4: corrupting sector at zone=vsr.Zone.grid offset=106496 address=27
[debug] (storage): corrupting torn write on crash
[debug] (storage): 4: corrupting sector at zone=vsr.Zone.wal_prepares offset=69632 slot=17
[debug] (storage): corrupting torn write on crash
[debug] (storage): 4: corrupting sector at zone=vsr.Zone.wal_headers offset=4096 slots=16...32

So, replica crashed while writing all header sectors simultaneously.

Mar 18 '24 11:03 matklad

This seems like a theoriticaly real problem. Two potential fixes:

in the journal, make sure that we are not writing all slots at the same time
in the Replica.open, use the header from checkpoint even if the entire WAL is corrupted

I should get to doing the second option at some point

Mar 18 '24 11:03 matklad

:joy:

        /// `write_prepare` uses `write_sectors` to prevent concurrent disk writes.
        // TODO To guard against torn writes, don't write simultaneously to all redundant header
        // sectors. (This is mostly a risk for single-replica clusters with small WALs).
        pub fn write_prepare(

Mar 18 '24 19:03 sentientwaffle