twemproxy Robust hash ring failure retry mechanism

There are 2 step to check connection.

step 1] trying to connect in server_failure

If it succeed, going to step 2

step 2] trying to send heartbeat command to fail server

If it succeed, it will update hash ring.

and it uses array just to connect failed servers to prevent infinity loop for connecting

and in heartbeat stage. It uses msg_two_insert for timeout

but, if there is no timeout setting in conf, It will just wait for response like other message.

Dec 20 '12 19:12 charsyam

https://github.com/twitter/twemproxy/issues/14

Dec 20 '12 19:12 charsyam

I don't know why travis-ci failed.

Dec 20 '12 19:12 charsyam

I restarted the build... looks like a transient failure...

Dec 20 '12 19:12 caniszczyk

Yes. fortunately, It was a transient failure. Thank you @caniszczyk

Dec 20 '12 19:12 charsyam

@charsyam I will take a look at this in next two days

Dec 20 '12 22:12 manjuraj

@manjuraj Thank you. I will wait for your advice.

Dec 20 '12 22:12 charsyam

any update on this? Looks like a good solution that I am also interested in.

May 25 '13 21:05 sushantk

is there any reason, the patch cannot work with random distribution? I guess that the nc_random.c can also be updated with the changes proposed to src/hashkit/nc_ketama.c.

May 25 '13 22:05 sushantk

@sushantk I will check it as soon as possible.

May 25 '13 23:05 charsyam

@charsyam - yes, we do have timeout option in the conf with random distribution mode. It works, but suffers form the same problem that the client request is used to reconnect on failure. I am assuming that your patch reconnects in background before putting the server back in rotation.

I will appreciate if you can update your patch to include enhancement to nc_random.c, it should be the same mod that you have in nc_ketma.c

May 25 '13 23:05 sushantk

@sushantk Yes. it's a bug. so I will patch it as soon as possible. :) Thank you for your reporting. :)

May 26 '13 00:05 charsyam

@sushantk hi. guy. I patched it. please test it :) it will work with random and module.

May 26 '13 00:05 charsyam

@manjuraj Hey Manju, do you plan to review and merge this patch any time soon? Thanks.

May 31 '13 17:05 sushantk

Manju, I have tested this out and it seems to work fine. When would this patch be added to the master branch ?

May 31 '13 18:05 hrishimantri

@hrishimantri @sushantk I created a new branch -- twemproxy_heartbeat that contains @charsyam heartbeat patch. See: https://github.com/twitter/twemproxy/tree/twemproxy_heartbeat

please use this branch to serve your needs. If no issues surface up after a month of usage (june 30th), I will go ahead and merge this branch with master

@charsyam if you could you also cleanup the patch by conforming to the code style detailed here: https://github.com/twitter/twemproxy/blob/master/notes/c-styleguide.txt, I would really appreciate it.

May 31 '13 18:05 manjuraj

@manjuraj Thank you. I will do it ASAP

Jun 01 '13 01:06 charsyam

@charsyam @manjuraj this doesnt seem to work when the server_failure_limit is set to 2. It keeps on showing connection refused error in the log. Setting it to 1 works fine. But when set to 2 it just fails to eject the node from the group. Can you guys check if this behavior is seen at your side too ?

Jun 12 '13 22:06 hrishimantri

@hrishimantri i will check ASAP

Jun 13 '13 00:06 charsyam

@charsyam this is the config I used for testing :

redis_read: listen: 0:22122 hash: murmur distribution: random auto_eject_hosts: true redis: true server_retry_timeout: 60000 server_connections : 2 server_failure_limit: 2 backlog: 512 preconnect: true timeout: 5000 servers:

host1:6378:1 slave1
host2:6379:1 slave2

Jun 13 '13 02:06 hrishimantri

@charsyam any updates on this ?

Jun 14 '13 16:06 hrishimantri

@hrishimantri Would you show me some error log? In my case, It seems to work well.

[Sat Jun 15 11:03:20 2013] nc_response.c:120 s 6 active 0 is done
[Sat Jun 15 11:03:20 2013] nc_core.c:229 close connection: 6
[Sat Jun 15 11:03:20 2013] nc_core.c:241 close s 6 '127.0.0.1:10000' on event 0001 eof 1 done 1 rb 73 sb 150  
[Sat Jun 15 11:03:20 2013] nc_response.c:120 s 10 active 1 is done
[Sat Jun 15 11:03:20 2013] nc_core.c:229 close connection: 10
[Sat Jun 15 11:03:20 2013] nc_core.c:241 close s 10 '127.0.0.1:10000' on event 0001 eof 1 done 1 rb 75 sb 184  
[Sat Jun 15 11:03:20 2013] nc_core.c:229 close connection: 6
[Sat Jun 15 11:03:20 2013] nc_core.c:241 close s 6 '127.0.0.1:10000' on event 001D eof 0 done 0 rb 0 sb 0: Connection refused <-- It is a try to connect.
[Sat Jun 15 11:04:20 2013] nc_server.c:989 updating pool 0 'leaf',restored server 'server1'
[Sat Jun 15 11:04:20 2013] nc_response.c:179 ok reconnect 43 len 5 on s 6

Thank you. If you show me som logs for it. I will appreciate you.

Jun 15 '13 02:06 charsyam

@charsyam these are the logs i see :

[Mon Jun 17 16:55:40 2013] nc.c:177 nutcracker-0.2.4 started on pid 11560 [Mon Jun 17 16:55:40 2013] nc.c:181 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one [Mon Jun 17 16:55:40 2013] nc_stats.c:841 m 3 listening on '0.0.0.0:22222' [Mon Jun 17 16:55:40 2013] nc_proxy.c:207 p 6 listening on '0:22122' in redis pool 0 'redis_read' with 2 servers [Mon Jun 17 16:56:00 2013] nc_proxy.c:337 accepted c 7 on p 6 from '127.0.0.1:48350' [Mon Jun 17 16:56:00 2013] nc_core.c:239 close s 8 'x.y.z.a:6378' on event 0019 eof 0 done 0 rb 0 sb 0: Connection refused [Mon Jun 17 16:56:00 2013] nc_core.c:239 close s 8 'x.y.z.a:6378' on event 0019 eof 0 done 0 rb 0 sb 0: Connection refused [Mon Jun 17 16:56:01 2013] nc_core.c:239 close c 7 'unknown' on event 0019 eof 0 done 0 rb 59534 sb 25: Connection reset by peer [Mon Jun 17 16:56:09 2013] nc_proxy.c:337 accepted c 7 on p 6 from '127.0.0.1:48359' [Mon Jun 17 16:56:10 2013] nc_core.c:239 close c 7 'unknown' on event 0019 eof 0 done 0 rb 314 sb 54: Connection reset by peer [Mon Jun 17 16:56:14 2013] nc_proxy.c:337 accepted c 7 on p 6 from '127.0.0.1:48361' [Mon Jun 17 16:56:14 2013] nc_proxy.c:337 accepted c 10 on p 6 from '127.0.0.1:48363' [Mon Jun 17 16:56:14 2013] nc_core.c:239 close s 11 'x.y.z.a:6378' on event 0019 eof 0 done 0 rb 0 sb 0: Connection refused [Mon Jun 17 16:56:14 2013] nc_core.c:239 close c 7 'unknown' on event 0019 eof 0 done 0 rb 36 sb 11: Connection reset by peer [Mon Jun 17 16:56:14 2013] nc_core.c:239 close c 10 'unknown' on event 0019 eof 0 done 0 rb 36 sb 25: Connection reset by peer [Mon Jun 17 16:56:18 2013] nc_proxy.c:337 accepted c 7 on p 6 from '127.0.0.1:48368' [Mon Jun 17 16:56:19 2013] nc_core.c:239 close c 7 'unknown' on event 0019 eof 0 done 0 rb 505 sb 54: Connection reset by peer [Mon Jun 17 16:56:23 2013] nc_proxy.c:337 accepted c 7 on p 6 from '127.0.0.1:48369' [Mon Jun 17 16:56:23 2013] nc_core.c:239 close s 10 'x.y.z.a:6378' on event 0019 eof 0 done 0 rb 0 sb 0: Connection refused [Mon Jun 17 16:56:23 2013] nc_core.c:239 close c 7 'unknown' on event 0019 eof 0 done 0 rb 36 sb 25: Connection reset by peer

My maven tests keep on failing which means that the request is not sent to the other host from the group. Running this on RHEL 6.3 64 bit OS.

Jun 17 '13 16:06 hrishimantri

@charsyam to explain you more on how I'm testing this :

There are 2 redis instances running on two different boxes
Twemproxy is installed on a client box which talks to these two boxes
The twemproxy configuration is as pointed in https://github.com/twitter/twemproxy/pull/29#issuecomment-19367723
I use a Jedis Client to talk to Redis via twemproxy

Can you please go ahead and test these scenarios : Testcase 1) Both the redis instances are up and running (twemp works fine)

Testcase 2) Stop Redis instance running on one of the box and then check if the request goes to the other slave where REDIS is up and running. If the failure_limit is set to 2 in the twemproxy conf then the node fails to be ejected. Though the Node gets ejected when the failure_limit is set to 1 and the request is then routed to the other slave.

Testcase 3) Restart the redis instance stopped in Testcase 2. The node should get restored and the requests should again be distributed randomly amongst the two slaves.

Testcase 1 and Testcase 3 work fine but Testcase 2 fails to work on failure_limit set to 2. Can you please retest this at your side ?

Jun 17 '13 17:06 hrishimantri

@charsyam - any updates to my previous comment ? Did you get a chance to look into testcase 2 ?

Jun 19 '13 16:06 hrishimantri

@hrishimantri :) Thank you. but actually these days I'm too busy in my personal business . sorry. ASAP, I will check it again. Thank you for your kind report. maybe this weekend, I will retry :) Thank you.

Jun 20 '13 00:06 charsyam

@hrishimantri Hi guys. Thank you for your kind reporting. so I could look for the reason. and I fixed it. I think it could be fixed by your effort.

Thank you.

Jun 25 '13 12:06 charsyam

@manjuraj could you apply this patch to twemproxy_heartbeat branch? Thank you.

Jun 25 '13 12:06 charsyam

@manjuraj can you please apply this patch to the twemproxy_heartbeat branch ?

Jun 26 '13 16:06 hrishimantri

done - https://github.com/twitter/twemproxy/commit/049c34b3de8c23f5739f66aee7ad6924549bed18

Jun 26 '13 18:06 manjuraj

@manjuraj Thank you :)

Jun 26 '13 23:06 charsyam