twemproxy handle max open file limit reached gracefully

when we reach max open file limit, nutcracker accept() loop errors out:

See: https://github.com/twitter/twemproxy/blob/master/src/nc_proxy.c#L291

we should handle this scenario gracefully

Apr 12 '13 19:04 manjuraj

from @jokea

Can not reopen this issue, so I'm commenting here. This issue should be solved at the server side by rejecting any new connections when the max open files limit is reached, existing connections should work as usual.

Currently if the limit is reached, the server becomes completely dead, blocking in a epoll_wait call.

# strace -p 24027
Process 24027 attached - interrupt to quit
epoll_wait(6,  ^C 
Process 24027 detached

# gdb -p 24027
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 24027
Reading symbols from /usr/local/nutcracker/nutcracker...(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x7f705a9886e0 (LWP 24027)]
[New Thread 0x423cf940 (LWP 24028)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
(no debugging symbols found)
0x000000321b6d4018 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x000000321b6d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000409427 in event_wait ()
#2  0x00000000004051a3 in core_loop ()
#3  0x000000000040ee6d in main ()
(gdb)

Apr 12 '13 19:04 manjuraj

We run into that at Tumblr. Our work around is less than elegant.

Apr 12 '13 22:04 bmatheny

@bmatheny how do you handle this at tumblr?

Apr 12 '13 23:04 manjuraj

Embarrassingly :) This isn't quite as simple a fix as 'refuse connections if you run out of file descriptors'. The way we run twemproxy, this condition should be close to impossible in the general case, yet it happens with a bit of regularity (several times a week across about a thousand instances). Digging into this, tracking down what was actually causing the FD leak became time consuming so we opted to go with a 'simple' fix.

We already use monit for restarting twemproxy when the twem config changes. We just setup monit to also look for this out of FD condition and restart it in that case as well.

Apr 14 '13 01:04 bmatheny

Hi we are running into this condition. Just wondering if there was any update after the last conversation? Is the restart only the solution?

Jul 03 '13 17:07 harishd

@harishd one workaround is to increase the default file description limit. see http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

Jul 03 '13 18:07 manjuraj

Thanks Manju! So as of now I have already changed my limits.conf I have the hard/soft limits to 60000. So this means that the server is probably running out of them too (>60000). Do you recommend adding more servers at this stage? or do we have any recommendations on the max hard/soft limits for twemproxy?

Jul 03 '13 18:07 harishd

@harishd you probably want twemproxy to be deployed as local proxy (see slide 4 here: https://speakerdeck.com/justinmares/twemproxy). Deploying it as local proxy has better fault tolerance semantics and doesn't end up with file descriptor problem

Jul 03 '13 18:07 manjuraj

Actually I think this may be a bug in twemproxy. My understanding is that open file descriptors should be on the order of: (pool count * client count) + (pool count * backend count * open connections to each backend). We found that there were occasions where with ~8 pools, 50 client connections per pool, 15 backends per pool, 20 connections per backend, we would run out of file descriptors with a limit of 8192. The upper bound on open file descriptors with that config should have been ~2800. Is my assumption about FD count incorrect?

Jul 03 '13 18:07 bmatheny

@bmatheny You are right. The FD count is:

x = (pool_count * client_connections_per_pool) + (pool_count * backends_per_pool * server_connections)

Usually we want to set the file descriptor limit to 2x because of the TIME_WAIT lingering on a close -- https://github.com/twitter/twemproxy/blob/master/notes/socket.txt#L61-L87

But even with 2x number 2800 * 2 < 8192. So, I'm not sure why we ran out of the 8192 limit. We should be able to easily debug this scenario with a test program and lowering the file descriptor limit to a really low value (like -- $ulimit -n 32) to hit this scenario

Jul 03 '13 19:07 manjuraj

Just a side note. Be aware, that /etc/security/limits.conf is used only for interactive user sessions. It does not change file descriptor limit for daemons. You can verify actual limits of a running process in /proc/PID/limits

Jul 03 '13 21:07 rraptorr

twemproxy is great work, but this issue bothers us. We find a simple way to fix it: https://github.com/allenlz/twemproxy/commit/cb7bbf39ff1a700682e4c3cd0f25008b15dd307d

Apr 04 '14 12:04 allenlz

I update the new fix with a PR #232, which maintains a safe max number for client connections. Any comment is welcome.

Jun 20 '14 03:06 allenlz

I'm also encountering this issue. Reproducible using redis-benchmark -p 22121 -d 100 -c 1500 -r 20 -l -t set,get (http://redis.io/topics/benchmarks). With too many simultaneous connections Twemproxy gets into a state where it refuses connections permanently, even after ending the benchmark (redis-cli no longer connects). @allenlz fix works for me; Twemproxy recovers and continues accepting connections.

It still doesn't accept 1500 connections though, despite changing ulimit.

Jul 04 '14 13:07 archena

Not a maintainer, but it may be possible to close this issue now that the linked PR is merged in 2014? (unless other things were planned)

Apr 29 '21 01:04 TysonAndre

twemproxy twemproxy copied to clipboard

handle max open file limit reached gracefully

twemproxy
twemproxy copied to clipboard