old-raft-rs icon indicating copy to clipboard operation
old-raft-rs copied to clipboard

Add comprehensive tests, especially for fail recovery

Open foodhype opened this issue 9 years ago • 1 comments

First, I suggest breaking up the current basic_test into many tests.

Second, I suggest having a test that (1) completely kills the process on which a node is running; (2) restarts a completely new process (with no knowledge from other processes) to replace the dead process; (3) blocks issuing any new commands until some stabilization period has passed; and (4) asserts that the new node is able to get back up to speed with the old state on its own. (Leader recovery and follower recovery are separate cases, obviously.)

Running the nodes on separate processes will guarantee that no state is shared between nodes except through asynchronous message passing. Killing the process completely and abruptly will allow testing for edge cases that occur during real machine failures, such as reusing old sockets, cleaning up resources, recovering state from scratch (for the recovering process), and failure detection/handling (by neighbor processes).

Common methods of testing include having a master spawn processes, issue a sequence of commands, stabilize, and then kill the process completely. Other methods include having a time bomb mechanism whereby the master server commands processes to immediately crash themselves after performing a certain number of commands, which allows more fine-grained scenario testing.

foodhype avatar Mar 05 '15 23:03 foodhype

Absolutely! basic_test is bigger then it needs to be. Thanks for these suggestions!

Hoverbear avatar Mar 06 '15 05:03 Hoverbear