redis_failover icon indicating copy to clipboard operation
redis_failover copied to clipboard

Proposal: Real-world test cases

Open eric opened this issue 13 years ago • 2 comments
trafficstars

I wanted to document some of the real-world test cases I've been envisioning for a test suite for this library.

The Setup

It seems like it would be pretty easy to setup a local environment to test some of this stuff:

  • 3 zookeeper servers
  • 2 redis servers
  • 2 clients
  • 2 node monitors

to give us a chance to kill or hang each component and make sure everything reacts appropriately.

Scenarios

Here is an incomplete list of tests that I think should be run against a real set of redis servers and clients.

  • Kill a redis server with SIGKILL (a kill -9) — ensure the failover happens immediately
  • Pause a redis server (causing a hang) with SIGSTOP — ensure the monitor process notices the hang and starts a failover
  • Kill the master monitor process with SIGKILL — ensure another monitor takes over
  • Pause the master monitor process with SIGSTOP and then kill redis with SIGKILL — How long does this take to failover?

Monitoring

While running these tests, it would be worthwhile for the redis clients to be constantly running SET commands against redis.

Tracking the average and max times for requests would be helpful in understanding how long failover really takes. Using my metriks library may be helpful in getting those statistics easily.

I envision the redis client processes having an at_exit defined that would output statistics like the number of keys set, the number of errors, and the average and max times per SET. We could easily compare the number of keys they thought they set with the number that the final master has, to see what sort of failures happened.

eric avatar Apr 22 '12 10:04 eric

Nice! Thanks for putting these testing scenarios together. I have been doing similar testing locally with a 5 node Redis cluster and 5 node ZK cluster. I also have 2 node managers. All of my testing has been with SIGKILL, however. I'd love to get your help on setting this up too. You have some great ideas here.

ryanlecompte avatar Apr 22 '12 10:04 ryanlecompte

Using SIGSTOP and SIGCONT is a great way to ensure that everything works properly with a hung process instead of just a killed one — both cases are important to handle, but the hung case can be harder.

eric avatar Apr 22 '12 10:04 eric