sled ALICE-like write permuter

alice (#83) is a tool that uses strace to record file modifications, then uses models of various file systems to permute copies of the output file, and then tries to load the file back into the system to assert basic correctness properties of the system. this test methodology found bugs in every system it targeted.

we can build a lightweight version by just recording file writes and permuting the file in various ways. we can record a trace of file operations, represented as fsyncs, writes, truncate operations, etc...

once we have the trace, we can "replay" the operations to construct various log files that we try to load and assert various properties about. when replaying the operations, we can reorder any operations that happen between fsyncs. we can start with an empty file and grow it forward, trying different permutations of present items depending on where writes happened in relation to syncs. there tend not to be so many operations between sync calls, so our histories are actually not so expensive to exhaustively enumerate.

interesting bugs that are specific to sled's guarantees can also be tested for. because sled rewrites old log files without zeroing them first (we zero on recovery to prevent "replay" attack-like corruption). we should be able to write arbitrary subsets of data, as well as intentionally flipping bits between fsyncs, and still maintain our desired properties. sled can sort of be interpreted as a heuristic for recovering partially corrupted files, anyway.

May 23 '20 09:05 spacejam

It's preferable to use our version compared to the actual ALICE tool because the actual tool is pretty heavy-weight, and by instrumenting our code directly, the overall complexity can be lower and performance can be much higher.

May 23 '20 09:05 spacejam

This is an exciting avenue! One thing I like is that it could also catch some of the issues the failpoints and crash recovery tests do, but reproducibly, by chopping off the trace at different points. (maybe once the write permuter is stood up, we could make a hybrid ALICE/failpoints test that would allow reproducing trickier things like double-recovery bugs)

I think the best ways to collect traces would be to either add a bunch of IO operations to the existing event log infrastructure, (perhaps behind yet another feature flag?) or to move all IO operations behind a trait, make implementations of the trait for real IO, traced IO, and mocked IO, and parameterize the PageCache on that trait. I'm not sure which would be the best route; my chief concerns are the cost of copying so many buffers for the event log affecting non-ALICE tests, and the possible developer UX impacts of introducing a generic type parameter on public types, even if it has a default.

May 24 '20 16:05 divergentdave

On discord, piotrek mentioned that in their research performing write-reordering tests targeting persistent memory systems, the most value was achieved simply by playing writes one after another, and in reverse order. They said that doing a far more computationally expensive test that tested many permutations and took lots of time actually yielded no bugs compared to the simple write traversal and reverse write traversal tests.

@divergentdave I think that could be a nice approach, or maybe having a separate simpler write recorder thing that maybe gets used like this:

let conf = sled::Config::new().temporary(true);

// attach some write recorder thing that is conditionally compiled
let recorder = conf.recorder();

let db = conf.open().unwrap();

// do workload that results in some monotonically verifiable data

// `permutations` uses the recorded log and generates various full
// sled directories where writes to the log are ordered by fsyncs.
// I think blobs can just be copied because they are pessimistically
// always fsynced? need to think about that part more...
for permuted_db_directory_path in recorder.permutations() {
    let db = sled::open(permuted_db_directory_path).expect("recovery should never fail");
    // maybe make other assertions about writes that we issued in monotonic order
    // similar to the current crash test
}

May 26 '20 11:05 spacejam

I had some stray thoughts on this I wanted to jot down.

The ALICE paper says they "serialize system calls in the order they were issued", with the assumption that most applications will synchronize between threads to coordinate I/O calls. This is probably sufficient to shake out stability/durability issues, but as extension, synchronization between threads could be added to the tracing and simulation, with instrumentation of atomic operations, mutexes, and joining threads. For example, if each thread records the parameters, memory address, and result of each atomic operation, (and construction and destruction) then when the traces are brought together, we can draw a line of causality from an operation that set a value to the next operation on each other thread that observed it. With a complete picture of all synchronization, we would then be able to serialize the I/O syscalls into multiple different sequences (while still obeying the observed synchronization between threads) and then feed those into the ALICE simulation logic.

Nov 15 '20 19:11 divergentdave

actyx is considering using sled. I was tasked with evaluating it. In particular the properties we care about are that the data is durable after a flush completed successfully. For this purpose I wrote a simple alice workload and checker https://github.com/dvc94ch/alice-sled and obtained the following results:

-------------------------------------------------------------------------------
ALICE tool version 0.0.1. Please go through the documentation, particularly the
listed caveats and limitations, before deriving any inferences from this tool.
-------------------------------------------------------------------------------
Parsing traces to determine logical operations ...
Logical operations:
0	mkdir("db", parent=510920, mode='0777', inode=510921)
1	mkdir("db/blobs", parent=510921, mode='0777', inode=510922)
2	creat("db/conf", parent=510921, mode='0666', inode=510923)
3	append("db/conf", offset=0, count=58, inode=510923)
4	append("db/conf", offset=58, count=4, inode=510923)
5	creat("db/db", parent=510921, mode='0666', inode=510924)
6	creat("db/blobs/295161", parent=510922, mode='0666', inode=510925)
7	append("db/blobs/295161", offset=0, count=4, inode=510925)
8	append("db/blobs/295161", offset=4, count=1, inode=510925)
9	append("db/blobs/295161", offset=5, count=327748, inode=510925)
10	fsync("db/db", size=0, inode=510924)
11	append("db/db", offset=0, count=295177, inode=510924)
12	fsync("db/db", size=295177, inode=510924)
-------------------------------------
Finding vulnerabilities...
(Dynamic vulnerability) Across-syscall atomicity, sometimes concerning durability: Operations 0 until 12 need to be atomically persisted
(Static vulnerability) Across-syscall atomicity: Operation B-/home/dvc/ipld/alice-sled/target/release/sled-workload:0x56044bbde468[None] until B-/usr/lib/libpthread-2.32.so:0x7f451db6757b[fsync]
Done finding vulnerabilities.

Not sure yet how to interpret them, but I assume they are false positives. Next is dumping our existing sqlite data into sled and seeing if it blows up.

Nov 24 '20 16:11 dvc94ch

Hmm, perhaps we should try fsyncing the directories and configuration file when creating a new database. I'd recommend turning on debugging symbols in your release profile, so that the static vulnerability reports are more readable.

Nov 24 '20 17:11 divergentdave

I added fsync calls on directories, and I still got the same error messages, so that wasn't it (see below). FWIW, I had to patch the ALICE tool in a number of ways to get it to run. (adding prlimit64 and statx as innocent syscalls, fixing the typo of sync_file_range as file_sync_range throughout, and parsing sync_file_range strace lines into micro-operations). It seems that it is unmaintained and has not kept up with the kernel's API surface. I also got some additional warnings I didn't understand, which seemed to be about dependencies between operations.

Output

-------------------------------------------------------------------------------
ALICE tool version 0.0.1. Please go through the documentation, particularly the
listed caveats and limitations, before deriving any inferences from this tool. 
-------------------------------------------------------------------------------
Parsing traces to determine logical operations ...
----------------------------------------------------------
WARNING: (j_final >= i_initial and j_final <= i_final)
  File "/home/david/Source/alice/bin/alice-check", line 77, in <module>
    alicedefaultexplorer.default_checks(alice_config, threads = args.threads)
  File "/home/david/Source/alice/alicedefaultexplorer.py", line 113, in default_checks
    replayer.set_fs(defaultfs('count', 1))
  File "/home/david/Source/alice/alice.py", line 384, in set_fs
    fs.get_deps(all_diskops)
  File "/home/david/Source/alice/alicedefaultfs.py", line 269, in get_deps
    traceback.print_stack(file = sys.stdout)
----------------------------------------------------------
----------------------------------------------------------
WARNING: not (j_final >= i_initial and j_final <= i_final)
  File "/home/david/Source/alice/bin/alice-check", line 77, in <module>
    alicedefaultexplorer.default_checks(alice_config, threads = args.threads)
  File "/home/david/Source/alice/alicedefaultexplorer.py", line 113, in default_checks
    replayer.set_fs(defaultfs('count', 1))
  File "/home/david/Source/alice/alice.py", line 384, in set_fs
    fs.get_deps(all_diskops)
  File "/home/david/Source/alice/alicedefaultfs.py", line 259, in get_deps
    traceback.print_stack(file = sys.stdout)
----------------------------------------------------------
Logical operations:
0	mkdir("heap", parent=1434702, mode='0777', inode=1431748)
1	fsync("heap", size=40, inode=1431748)
2	creat("conf", parent=1434702, mode='0666', inode=1431751)
3	append("conf", offset=0, count=58, inode=1431751)
4	append("conf", offset=58, count=4, inode=1431751)
5	creat("DO_NOT_USE_THIS_DIRECTORY_FOR_ANYTHING", parent=1434702, mode='0666', inode=1431752)
6	creat("db", parent=1434702, mode='0666', inode=1431753)
7	fsync("/home/david/Code/alice-sled/workload_dir", size=120, inode=1434702)
8	creat("heap/00", parent=1431748, mode='0666', inode=1431756)
9	creat("heap/01", parent=1431748, mode='0666', inode=1431757)
10	creat("heap/02", parent=1431748, mode='0666', inode=1431758)
11	creat("heap/03", parent=1431748, mode='0666', inode=1431759)
12	creat("heap/04", parent=1431748, mode='0666', inode=1431760)
13	creat("heap/05", parent=1431748, mode='0666', inode=1431761)
14	creat("heap/06", parent=1431748, mode='0666', inode=1431762)
15	creat("heap/07", parent=1431748, mode='0666', inode=1431763)
16	creat("heap/08", parent=1431748, mode='0666', inode=1431764)
17	creat("heap/09", parent=1431748, mode='0666', inode=1431765)
18	creat("heap/10", parent=1431748, mode='0666', inode=1431766)
19	creat("heap/11", parent=1431748, mode='0666', inode=1431767)
20	creat("heap/12", parent=1431748, mode='0666', inode=1431768)
21	creat("heap/13", parent=1431748, mode='0666', inode=1431769)
22	creat("heap/14", parent=1431748, mode='0666', inode=1431770)
23	creat("heap/15", parent=1431748, mode='0666', inode=1431771)
24	creat("heap/16", parent=1431748, mode='0666', inode=1431772)
25	creat("heap/17", parent=1431748, mode='0666', inode=1431773)
26	creat("heap/18", parent=1431748, mode='0666', inode=1431774)
27	creat("heap/19", parent=1431748, mode='0666', inode=1431775)
28	creat("heap/20", parent=1431748, mode='0666', inode=1431776)
29	creat("heap/21", parent=1431748, mode='0666', inode=1431777)
30	creat("heap/22", parent=1431748, mode='0666', inode=1431778)
31	creat("heap/23", parent=1431748, mode='0666', inode=1431779)
32	creat("heap/24", parent=1431748, mode='0666', inode=1431780)
33	creat("heap/25", parent=1431748, mode='0666', inode=1431781)
34	creat("heap/26", parent=1431748, mode='0666', inode=1431782)
35	creat("heap/27", parent=1431748, mode='0666', inode=1431783)
36	creat("heap/28", parent=1431748, mode='0666', inode=1431784)
37	creat("heap/29", parent=1431748, mode='0666', inode=1431785)
38	creat("heap/30", parent=1431748, mode='0666', inode=1431786)
39	creat("heap/31", parent=1431748, mode='0666', inode=1431787)
40	trunc("db", initial_size=0, inode=1431753, final_size=524288)
41	append("db", offset=524288, count=327779, inode=1431753)
42	write("db", offset=295163, count=229125, inode=1431753)
43	sync_file_range("db", count=327779, offset=524288, inode=1431753)
44	sync_file_range("db", count=229125, offset=295163, inode=1431753)
45	write("db", offset=0, count=295163, inode=1431753)
46	sync_file_range("db", count=295163, offset=0, inode=1431753)
47	fsync("db", size=852067, inode=1431753)
48	fsync("db", size=852067, inode=1431753)
-------------------------------------
Finding vulnerabilities...
(Dynamic vulnerability) Across-syscall atomicity, sometimes concerning durability: Operations 0 until 45 need to be atomically persisted
(Static vulnerability) Across-syscall atomicity: Operation library/std/src/sys/unix/mod.rs:199[std::sys::unix::fs::DirBuilder::mkdir] until library/std/src/sys/unix/mod.rs:199[std::sys::unix::fd::FileDesc::write_at]
^CTraceback (most recent call last):

Dec 19 '20 23:12 divergentdave

Yes I had to make various changes to ALICE too. It's a bit messy to get it to run. fwiw we are sticking with sqlite for now as the quickest way to fix our performance issues was to redesign the sqlite code.

Dec 20 '20 00:12 dvc94ch

After puzzling over the ALICE output, I believe it may be pointing out a vulnerability that has been floating around in the back of my mind: we only use a full fsync on the db file when we decrease its size. When we allocate a new segment on the end of the log, it may be written to, and when the segment is completely written, we then issue a sync_file_range. If we crash before the metadata has been written, the system will recover with a shorter file length, which will cause reads to miss the data that has been made durable via sync_file_range.

This is the only such vulnerability that I know about, and my current expectation is that once this is fixed, ALICE should not complain about atomicity.

One solution is to use fsync instead of sync_file_range in the occasions where the segment is not being reused. This is a relatively low-impact code change, and just involves storing a bool in the iobuf. This same solution needs to be applied for heap writes which use a similar slab allocation strategy which may allocate from the file tip without first fsyncing the new length. For this, we just need to include the bool with a heap Reservation and use fsync instead of sync_file_range when the slot has been allocated from the file tip rather than being reused from the middle. This solution may have some performance impacts that may be later mitigated by pre-allocating batches of the file tip proportionally to some cheap runtime measurement of contention related to allocating from the tip. More allocator thread blocking => more file tip "prefetching" on the threadpool.

Dec 20 '20 12:12 spacejam

I'm quite interested to hear whether ALICE still sees issues with the changes in https://github.com/spacejam/sled/pull/1224

Dec 20 '20 12:12 spacejam

I re-ran it with the fsync changes, and got a similar report about durability of operations 0 through the final write to "db". I think the issue is that the checker program is too strict, as it expects all ten key-value pairs to be present with the same values no matter what, and returns an error status code otherwise. This means that if ALICE creates a crashed state directory early in database creation, before the db.flush() call, the checker will report an invariant was violated, which explains the results we're seeing. We will need to make the workload and checker communicate whether db.flush() was called via the stdout file that gets passed, and then assert different things based on that.

Dec 20 '20 16:12 divergentdave

Double-recovery bugs have been hard to root out in the past, it would be good if we could apply ALICE to the second crash/recovery, as this would crank through a lot of possible recovery execution edge cases for us. From a cursory look, it appears that ALICE would be fine with tracing subprocesses, as I recall it tracks open file descriptors per process. We could create a workload that forks or spawns itself several times serially. The child process could open a database and do operations on it for some time before aborting from a timer thread. Then, ALICE's workload trace would consist of file system writes from multiple recoveries, with writes from normal database operation in between. As ALICE explores the trace, it will pick a wide variety of states that are midway through a recovery and run the checker on them, which will in turn run recovery starting from that state.

Dec 26 '20 15:12 divergentdave

sled sled copied to clipboard

ALICE-like write permuter

sled
sled copied to clipboard