quickcheck-state-machine icon indicating copy to clipboard operation
quickcheck-state-machine copied to clipboard

Towards Jepsen-like tests

Open stevana opened this issue 7 years ago • 7 comments

Jepsen is a framework for testing distributed systems. It has been used to find bugs in many systems, e.g. Kafka, Cassandra, RabbitMQ.

It is also based on performing randomly generate actions from multiple threads and then using linearisation to ensure correctness. However it is different in that it has a thread, called the Nemesis, which is solely dedicated to fault injection. Fault injections include skewed clocks, gc/io pauses (killall -s STOP/CONT), lost packets, timeouts, and network partitions between the distributed servers.

In order to account for the faults, Jepsen has a richer notion of history than ours which includes the possibily of operations failing or timing out. When an action failes, we know for sure that it did not change the state of the system, where as if an operation timed out we don't know for sure (the state could have changed, but the server ack that didn't reach us).

It would be neat if we could perform similar tests. Exactly how this is supposed to work is still not clear. Some answers can perhaps be found in the original linearizabiliy paper or in the many blog posts and talks by "Aphyr", the author of Jepsen, some of which are linked to in the README.

Perhaps a good start would be to only focus on failed actions to begin with. This issue seems to pop up in some of our examples already, see discussion in #159, and does not require a nemesis thread.

A first step might be to change the type of Semantics to account for possible failures:

data Result resp err = Ok resp | Fail err

type Semantics act m err = forall resp. act Concrete resp -> m (Result resp err)

(This can later be extended to deal with timeouts.)

The ResponseEvent constructor will need to be changed accordingly, and linearise as well. I guess the right thing to do in linearise is to not update the model and not check the post-condition. We could also change the post-condition make assertions about err in addition to resp, but maybe this is a refinement we can make when we see the need for it.

Thoughts?

stevana avatar Oct 03 '17 10:10 stevana

There's also the case when an action that returns a reference fails. In that case the subsequent actions that use the reference and have a precondition that the reference must exist in the model, will fail saying that the precondition was false.

stevana avatar Oct 04 '17 15:10 stevana

Crashes

If an operation does not complete for some reason (perhaps because it timed out or a critical component crashed) that operation has no completion time, and must, in general, be considered concurrent with every operation after its invocation. It may or may not execute.

A process with an operation is in this state is effectively stuck, and can never invoke another operation again. If it were to invoke another operation, it would violate our single-threaded constraint: processes only do one thing at a time.

Source: https://jepsen.io/consistency

stevana avatar Jul 05 '18 15:07 stevana

@stevana my understanding is that this https://github.com/advancedtelematic/quickcheck-state-machine/issues/208 is a prerequisite for Jepsen-like tests right?

kderme avatar Apr 09 '19 09:04 kderme

@kderme: No, I don't think so. (It might make some tests easier to write, but shouldn't be necessary.)

stevana avatar Apr 09 '19 09:04 stevana

The above PR adds the ability to complete histories. To be able to complete a history is needed when exceptions are thrown in the semantics, because of e.g. request timeouts (as will happen with partitions). We don't want to catch those exceptions because this would lead to a non-deterministic model, because if a timeout happen for a write-like command then we have no way to tell if it the request to the node timed out (database didn't change) or if the response back from the node timed out (database updated). And thus if we would catch this exception we would need to update the model with both possibilities, otherwise later read operations would fail.

What completing history does is that it simply appends a response to the end of the history for all threads/pids/workers that crashed/threw an exception. In the case of a write-like command the response is a simple "Ack". Now because of the way of how linearisability works, it will try all possible interleavings of this write-like command with all the later read-like commands and check if there's a possible interleaving that is consistent. So the linearisability checker handles/hides the non-determinism of the model!

What's still not clear to me is: can we find bugs without completing, e.g. maybe with partitions and some (simpler) version of accounting for this non-determinism in the model? Or somehow controlling the fault injection in a more precise way so that we know exactly if the timeout happened to or from the node, and thus know if the database was updated or not?

stevana avatar Jul 25 '19 09:07 stevana

@kderme is currently working on adding an example which uses rqlite (a distributed version of sqlite that uses Raft for consensus), hopefully this can serve as a test bed for experimenting with distributed systems and fault injection. See the following work-in-progress branch.

As a first experiment, the idea is to try to trigger a stale read in the weak read consistency mode by either stopping and restarting nodes or by causing partitions (perhaps using blockade).

stevana avatar Jul 25 '19 09:07 stevana

What's still not clear to me is: can we find bugs without completing, e.g. maybe with partitions and some (simpler) version of accounting for this non-determinism in the model? [...]

I think the answer is yes and that there's a trade-off here:

  1. Simple model that doesn't account for faults, needs completion, will be slow
  2. Complex model that accounts for faults, doesn't need completion, will be fast(er)

I've also learned why Jepsen doesn't have a completion function, if an operation crashes it advances the model purely from the request. This isn't possible in our case as our transition function also involves the response.

Lets make things a bit more concrete with an example, consider a simple counter that starts at 0 and can be incremented:

thread 1, request: increment
thread 2, request: increment
thread 1, response: ok
thread 2, response: timeout

At this point the value of the counter could be 1 or 2 depending on if the request on the second thread timed out while going to the server (the counter didn't get updated), or if the response going back to the client timed out (the counter got updated). Let's continue execution:

thread 1, request: read
thread 3, request: read
thread 1, response: read -> 1
thread 3, response: read -> 2

This seems weird, how can read return two different values, without an increment happening in between? Remember that crashing operations (e.g. the timeout on the second thread) are concurrent with all operations after it, so the effect of the increment could happen between the responses of the reads and thus making the history linearise.

Also note that the second thread cannot be used as that could break the "single-threaded constraint: processes only do one thing at a time", as per the comment above.

stevana avatar Jan 26 '20 10:01 stevana