twemproxy icon indicating copy to clipboard operation
twemproxy copied to clipboard

twemproxy work with redis-sentinel

Open andyqzb opened this issue 10 years ago • 48 comments

Hi, @manjuraj @idning , I made a patch for twemproxy to let it work with redis-sentinel. This patch is implemented as the design mentioned in issue 297.

In addition, I add keepalive to the sentinel pub/sub connection. Because sentinel pub/sub connection don't have heartbeat, we must add keepalive to check if the connection is dead. The implemention of nc_set_keepalive is copied from redis.

You can configure a server_pool like sigma in conf/nutcracker.yml to use this feature.

I hope to hear your feedback about this patch.

andyqzb avatar Mar 02 '15 17:03 andyqzb

This is looking really good. I'm going to see about testing this under load on Tagged and simulating some failovers to give it a real world test at scale.

mishan avatar Mar 05 '15 07:03 mishan

@mishan Great! I have tested the patch with some case, and use redis-benchmark to test it under load. But it's essential to test it under real world work load.

andyqzb avatar Mar 05 '15 09:03 andyqzb

@andyqzb Sorry for lack of updates from me, should get the chance to test this within a couple of weeks; made this a higher priority.

mishan avatar Mar 19 '15 06:03 mishan

Question: If we're using sentinel. Why have Redis server entries in the config when we should be getting that from Sentinel?

therealbill avatar Mar 23 '15 15:03 therealbill

@TheRealBill It's a good idea, we can drop the servers config if we use the sentinel. While, we should think about some detail problems which I said in issue #297 . So, I don't realize it in the patch to keep it simple at the beginning. But it's a candidate feature in the future.

andyqzb avatar Mar 24 '15 07:03 andyqzb

@andyqzb I'm having some issue running this with redis-sentinel 2.9.11

[2015-03-25 16:04:36.736] nc.c:192 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one [2015-03-25 16:04:36.739] nc_sentinel.c:186 unknown server name:7 [2015-03-25 16:04:36.739] nc_sentinel.c:457 sentinel's response error, close sentinel conn. [2015-03-25 16:04:36.739] nc_mbuf.c:300 mbuf read string failed, split char : [2015-03-25 16:04:36.739] nc_sentinel.c:457 sentinel's response error, close sentinel conn. [2015-03-25 16:04:36.739] nc_mbuf.c:300 mbuf read string failed, split char : [2015-03-25 16:04:36.739] nc_sentinel.c:457 sentinel's response error, close sentinel conn.

Here's the config I'm using: http://pastebin.com/txtRVRj6

Any ideas?

mishan avatar Mar 25 '15 23:03 mishan

@mishan Maybe the output of sentinel has changed. I have tested the patch with sentinel 2.6 and 2.8. I will check this problem later.

andyqzb avatar Mar 26 '15 02:03 andyqzb

@mishan Maybe you config a server named 7, while you don't config it in the twemproxy. The twemproxy will pull all the server addresses in the sentinel and assume that all the server addresses got are configured in the twemproxy. So twemproy think there are errors if it gets some servers which are not configured in the twemproxy.

andyqzb avatar Mar 26 '15 16:03 andyqzb

@andyqzb Oh, my config just has the masters in it, not the slaves. That could explain it. I'll double check.

mishan avatar Mar 26 '15 23:03 mishan

Yeah, I see my problem now, actually. There was a mismatch between the naming convention in nutcracker.yml and the sentinel configs. Will fix and report back on results of testing. Thanks!

mishan avatar Mar 27 '15 01:03 mishan

@mishan How about the testing?

andyqzb avatar Apr 23 '15 02:04 andyqzb

@andyqzb sorry haven't had the chance to try this with a production service yet, have had lots of work interrupts come up. I have it running on a development server now, probably will have time to test it next week on production. I'll let you know soon

mishan avatar Apr 23 '15 06:04 mishan

Let's make this merged, @andyqzb can you please add some cases with the test suite:

https://github.com/twitter/twemproxy/tree/master/tests

idning avatar Apr 26 '15 05:04 idning

looking forward to this feature!

thekiki avatar Apr 26 '15 23:04 thekiki

@idning I'll add some sentinel test cases in the next week.

andyqzb avatar Apr 28 '15 06:04 andyqzb

I have this operationalized in a development environment, didn't want to rush it to production end of week. I want to run an experiment next week by putting this on production. It's more than just running your patch though, since there's been a bunch of nutcracker changes in between. But seems like it's a go to test on prod!

mishan avatar May 30 '15 04:05 mishan

Def. Dont rush this 2 prod. Very fragile. On May 29, 2015 11:22 PM, "Misha Nasledov" [email protected] wrote:

I have this operationalized in a development environment, didn't want to rush it to production end of week. I want to run an experiment next week by putting this on production. It's more than just running your patch though, since there's been a bunch of nutcracker changes in between. But seems like it's a go to test on prod!

— Reply to this email directly or view it on GitHub https://github.com/twitter/twemproxy/pull/324#issuecomment-106986382.

spyderdyne avatar May 30 '15 06:05 spyderdyne

@mishan how did you get on testing this?

JamieCressey avatar Jun 24 '15 21:06 JamieCressey

@JamieCressey and everyone else, apologies I've made promises to have it on production earlier. I just got a house and moved and I've had things come up and naturally didn't want to rush it. I have it working normally on dev/stage.

I was just playing around with failing over on stage and am running into an issue with my application initially failing to connect to a shard, but the problem might either be that the .yml file doesn't get rewritten due to permissions (it's managed by Puppet, so we have to get around that) or that the application I'm using somehow doesn't handle it gracefully. Since twemproxy is a proxy, I don't see how the latter could be the case.

So, tl;dr: I need to do some more stage testing with manual failover as currently I haven't been entirely successful. Once that's green flag, I won't have hesitation pushing this to a prod tier that is not high risk but has plenty of volume

I have some limited time to poke at this tomorrow and then I'm out the next week. I really want to see this working, so I may end up spending some time next week or delegating the testing task to a teammate

More to come.

mishan avatar Jul 02 '15 01:07 mishan

Any news?

maralla avatar Jul 09 '15 11:07 maralla

@idning @mishan @maralla and everyone else, sorry for the delay of adding test cases. I just merged the updates from the master to my branch. @mishan can use the latest code to do the test. I will add the sentinel test cases in this weekend.

andyqzb avatar Jul 12 '15 16:07 andyqzb

@andyqzb I confirmed my issue was with not being able to write the new .yml file -- did some testing in a dev environment with sudden shutdown of all redis masters (worse cast scenario) and now it works. There's about 10 seconds after shutting down all the masters (20 of them) before the build I made off the latest of this branch before my requests stop returning errors talking to nutcracker.

[2015-07-15 20:15:03.733] nc_response.c:118 s 29 active 0 is done
[2015-07-15 20:15:03.757] nc_response.c:118 s 27 active 0 is done
[2015-07-15 20:15:03.776] nc_response.c:118 s 35 active 0 is done
[2015-07-15 20:15:03.800] nc_response.c:118 s 31 active 0 is done
[2015-07-15 20:15:03.876] nc_response.c:118 s 33 active 0 is done
[2015-07-15 20:15:14.958] nc_server.c:719 success switch sr-sdb000 to 10.98.20.105:6379:1
[2015-07-15 20:15:15.366] nc_server.c:719 success switch sr-sdb018 to 10.98.20.105:6397:1
[2015-07-15 20:15:15.980] nc_server.c:719 success switch sr-sdb003 to 10.98.20.105:6382:1
[2015-07-15 20:15:16.059] nc_server.c:719 success switch sr-sdb002 to 10.98.20.105:6381:1
[etc]

Granted it's on dev where everything is on small VMs instead of some high performance hardware. I'll see how it goes on stage tomorrow once I get the new build and init script ready and probably even end up putting on production for a work in progress tier. Even if there's 10 seconds of outage for a worst case scenario (all masters die), this is a huge improvement of what I have now!

Thanks!

mishan avatar Jul 16 '15 03:07 mishan

@mishan Do you config the sentienls's down-after-milliseconds to 10s? The proxy will switch its redis address to the new one immediately after the sentinels do the failover.

andyqzb avatar Jul 16 '15 06:07 andyqzb

@andyqzb Hrm, you called it! Not sure why that got end up set so high, 10 secs seems a little much, I'll have to look into this tomorrow.

Otherwise this is looking great then -- I've got a build ready to test on stage and hopefully production tomorrow; I'm feeling good about this!

Thanks

mishan avatar Jul 16 '15 06:07 mishan

Well it's happily running on production handling requests from an application to a Redis "cluster" (not Redis Cluster) doing 1MB/s -- tests on dev/stage I've done were pretty good. I haven't had a chance to do failover tests under production load yet, that's next.

mishan avatar Jul 17 '15 00:07 mishan

@idning I have added four test cases in the tests/test_system/test_sentinel.py. Maybe the test cases cost a little time when you run them, because they must wait the slave sync with master, the sentinel discover the slaves and process the failover.

andyqzb avatar Jul 20 '15 16:07 andyqzb

@andyqzb I've been using your branch nutcracker on prod for well over a month without any issues! This is great! I'd like to see this get merged to master.

A future enhancement I might suggest would be to add some new kind of runtime parameter to nutcracker that it can read its configs from read-only YML files and then maintain copies in, say, /var/run/nutcracker or something. I had to implement a hack to do this in the init.d script as our YML files are managed by puppet and I can't have nutcracker modify puppet-managed files.

But I would like to move that this gets merged sooner than later.

mishan avatar Sep 10 '15 16:09 mishan

@mishan I'm glad to hear your voice about the test in product environment. @idning is reviewing the pr, let's wait for his response.

About the conf rewrite, I take example by the redis. Maintaining multiple conf will make user confused, so I don't think it's a good solution. Maybe you can use script triggerd by sentinel to modify the puppet files.

andyqzb avatar Sep 12 '15 14:09 andyqzb

Any updates?

kri5 avatar Oct 29 '15 13:10 kri5

@andyqzb , hi , is there any update about this feature? May I use the twemproxy as the loading balance proxy and redis sentinel as the solution of failover ?

BinWu1989 avatar Dec 14 '15 05:12 BinWu1989