twemproxy work with redis-sentinel
Hi, @manjuraj @idning , I made a patch for twemproxy to let it work with redis-sentinel. This patch is implemented as the design mentioned in issue 297.
In addition, I add keepalive to the sentinel pub/sub connection. Because sentinel pub/sub connection don't have heartbeat, we must add keepalive to check if the connection is dead. The implemention of nc_set_keepalive is copied from redis.
You can configure a server_pool like sigma in conf/nutcracker.yml to use this feature.
I hope to hear your feedback about this patch.
This is looking really good. I'm going to see about testing this under load on Tagged and simulating some failovers to give it a real world test at scale.
@mishan Great! I have tested the patch with some case, and use redis-benchmark to test it under load. But it's essential to test it under real world work load.
@andyqzb Sorry for lack of updates from me, should get the chance to test this within a couple of weeks; made this a higher priority.
Question: If we're using sentinel. Why have Redis server entries in the config when we should be getting that from Sentinel?
@TheRealBill It's a good idea, we can drop the servers config if we use the sentinel. While, we should think about some detail problems which I said in issue #297 . So, I don't realize it in the patch to keep it simple at the beginning. But it's a candidate feature in the future.
@andyqzb I'm having some issue running this with redis-sentinel 2.9.11
[2015-03-25 16:04:36.736] nc.c:192 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one [2015-03-25 16:04:36.739] nc_sentinel.c:186 unknown server name:7 [2015-03-25 16:04:36.739] nc_sentinel.c:457 sentinel's response error, close sentinel conn. [2015-03-25 16:04:36.739] nc_mbuf.c:300 mbuf read string failed, split char : [2015-03-25 16:04:36.739] nc_sentinel.c:457 sentinel's response error, close sentinel conn. [2015-03-25 16:04:36.739] nc_mbuf.c:300 mbuf read string failed, split char : [2015-03-25 16:04:36.739] nc_sentinel.c:457 sentinel's response error, close sentinel conn.
Here's the config I'm using: http://pastebin.com/txtRVRj6
Any ideas?
@mishan Maybe the output of sentinel has changed. I have tested the patch with sentinel 2.6 and 2.8. I will check this problem later.
@mishan Maybe you config a server named 7, while you don't config it in the twemproxy. The twemproxy will pull all the server addresses in the sentinel and assume that all the server addresses got are configured in the twemproxy. So twemproy think there are errors if it gets some servers which are not configured in the twemproxy.
@andyqzb Oh, my config just has the masters in it, not the slaves. That could explain it. I'll double check.
Yeah, I see my problem now, actually. There was a mismatch between the naming convention in nutcracker.yml and the sentinel configs. Will fix and report back on results of testing. Thanks!
@mishan How about the testing?
@andyqzb sorry haven't had the chance to try this with a production service yet, have had lots of work interrupts come up. I have it running on a development server now, probably will have time to test it next week on production. I'll let you know soon
Let's make this merged, @andyqzb can you please add some cases with the test suite:
https://github.com/twitter/twemproxy/tree/master/tests
looking forward to this feature!
@idning I'll add some sentinel test cases in the next week.
I have this operationalized in a development environment, didn't want to rush it to production end of week. I want to run an experiment next week by putting this on production. It's more than just running your patch though, since there's been a bunch of nutcracker changes in between. But seems like it's a go to test on prod!
Def. Dont rush this 2 prod. Very fragile. On May 29, 2015 11:22 PM, "Misha Nasledov" [email protected] wrote:
I have this operationalized in a development environment, didn't want to rush it to production end of week. I want to run an experiment next week by putting this on production. It's more than just running your patch though, since there's been a bunch of nutcracker changes in between. But seems like it's a go to test on prod!
— Reply to this email directly or view it on GitHub https://github.com/twitter/twemproxy/pull/324#issuecomment-106986382.
@mishan how did you get on testing this?
@JamieCressey and everyone else, apologies I've made promises to have it on production earlier. I just got a house and moved and I've had things come up and naturally didn't want to rush it. I have it working normally on dev/stage.
I was just playing around with failing over on stage and am running into an issue with my application initially failing to connect to a shard, but the problem might either be that the .yml file doesn't get rewritten due to permissions (it's managed by Puppet, so we have to get around that) or that the application I'm using somehow doesn't handle it gracefully. Since twemproxy is a proxy, I don't see how the latter could be the case.
So, tl;dr: I need to do some more stage testing with manual failover as currently I haven't been entirely successful. Once that's green flag, I won't have hesitation pushing this to a prod tier that is not high risk but has plenty of volume
I have some limited time to poke at this tomorrow and then I'm out the next week. I really want to see this working, so I may end up spending some time next week or delegating the testing task to a teammate
More to come.
Any news?
@idning @mishan @maralla and everyone else, sorry for the delay of adding test cases. I just merged the updates from the master to my branch. @mishan can use the latest code to do the test. I will add the sentinel test cases in this weekend.
@andyqzb I confirmed my issue was with not being able to write the new .yml file -- did some testing in a dev environment with sudden shutdown of all redis masters (worse cast scenario) and now it works. There's about 10 seconds after shutting down all the masters (20 of them) before the build I made off the latest of this branch before my requests stop returning errors talking to nutcracker.
[2015-07-15 20:15:03.733] nc_response.c:118 s 29 active 0 is done
[2015-07-15 20:15:03.757] nc_response.c:118 s 27 active 0 is done
[2015-07-15 20:15:03.776] nc_response.c:118 s 35 active 0 is done
[2015-07-15 20:15:03.800] nc_response.c:118 s 31 active 0 is done
[2015-07-15 20:15:03.876] nc_response.c:118 s 33 active 0 is done
[2015-07-15 20:15:14.958] nc_server.c:719 success switch sr-sdb000 to 10.98.20.105:6379:1
[2015-07-15 20:15:15.366] nc_server.c:719 success switch sr-sdb018 to 10.98.20.105:6397:1
[2015-07-15 20:15:15.980] nc_server.c:719 success switch sr-sdb003 to 10.98.20.105:6382:1
[2015-07-15 20:15:16.059] nc_server.c:719 success switch sr-sdb002 to 10.98.20.105:6381:1
[etc]
Granted it's on dev where everything is on small VMs instead of some high performance hardware. I'll see how it goes on stage tomorrow once I get the new build and init script ready and probably even end up putting on production for a work in progress tier. Even if there's 10 seconds of outage for a worst case scenario (all masters die), this is a huge improvement of what I have now!
Thanks!
@mishan Do you config the sentienls's down-after-milliseconds to 10s? The proxy will switch its redis address to the new one immediately after the sentinels do the failover.
@andyqzb Hrm, you called it! Not sure why that got end up set so high, 10 secs seems a little much, I'll have to look into this tomorrow.
Otherwise this is looking great then -- I've got a build ready to test on stage and hopefully production tomorrow; I'm feeling good about this!
Thanks
Well it's happily running on production handling requests from an application to a Redis "cluster" (not Redis Cluster) doing 1MB/s -- tests on dev/stage I've done were pretty good. I haven't had a chance to do failover tests under production load yet, that's next.
@idning I have added four test cases in the tests/test_system/test_sentinel.py. Maybe the test cases cost a little time when you run them, because they must wait the slave sync with master, the sentinel discover the slaves and process the failover.
@andyqzb I've been using your branch nutcracker on prod for well over a month without any issues! This is great! I'd like to see this get merged to master.
A future enhancement I might suggest would be to add some new kind of runtime parameter to nutcracker that it can read its configs from read-only YML files and then maintain copies in, say, /var/run/nutcracker or something. I had to implement a hack to do this in the init.d script as our YML files are managed by puppet and I can't have nutcracker modify puppet-managed files.
But I would like to move that this gets merged sooner than later.
@mishan I'm glad to hear your voice about the test in product environment. @idning is reviewing the pr, let's wait for his response.
About the conf rewrite, I take example by the redis. Maintaining multiple conf will make user confused, so I don't think it's a good solution. Maybe you can use script triggerd by sentinel to modify the puppet files.
Any updates?
@andyqzb , hi , is there any update about this feature? May I use the twemproxy as the loading balance proxy and redis sentinel as the solution of failover ?