redis_failover
redis_failover copied to clipboard
Proposal: Real-world test cases
I wanted to document some of the real-world test cases I've been envisioning for a test suite for this library.
The Setup
It seems like it would be pretty easy to setup a local environment to test some of this stuff:
- 3 zookeeper servers
- 2 redis servers
- 2 clients
- 2 node monitors
to give us a chance to kill or hang each component and make sure everything reacts appropriately.
Scenarios
Here is an incomplete list of tests that I think should be run against a real set of redis servers and clients.
- Kill a redis server with
SIGKILL(akill -9) — ensure the failover happens immediately - Pause a redis server (causing a hang) with
SIGSTOP— ensure the monitor process notices the hang and starts a failover - Kill the master monitor process with
SIGKILL— ensure another monitor takes over - Pause the master monitor process with
SIGSTOPand then kill redis withSIGKILL— How long does this take to failover?
Monitoring
While running these tests, it would be worthwhile for the redis clients to be constantly running SET commands against redis.
Tracking the average and max times for requests would be helpful in understanding how long failover really takes. Using my metriks library may be helpful in getting those statistics easily.
I envision the redis client processes having an at_exit defined that would output statistics like the number of keys set, the number of errors, and the average and max times per SET. We could easily compare the number of keys they thought they set with the number that the final master has, to see what sort of failures happened.
Nice! Thanks for putting these testing scenarios together. I have been doing similar testing locally with a 5 node Redis cluster and 5 node ZK cluster. I also have 2 node managers. All of my testing has been with SIGKILL, however. I'd love to get your help on setting this up too. You have some great ideas here.
Using SIGSTOP and SIGCONT is a great way to ensure that everything works properly with a hung process instead of just a killed one — both cases are important to handle, but the hung case can be harder.