twemproxy
twemproxy copied to clipboard
handle max open file limit reached gracefully
when we reach max open file limit, nutcracker accept() loop errors out:
See: https://github.com/twitter/twemproxy/blob/master/src/nc_proxy.c#L291
we should handle this scenario gracefully
from @jokea
Can not reopen this issue, so I'm commenting here. This issue should be solved at the server side by rejecting any new connections when the max open files limit is reached, existing connections should work as usual.
Currently if the limit is reached, the server becomes completely dead, blocking in a epoll_wait call.
# strace -p 24027
Process 24027 attached - interrupt to quit
epoll_wait(6, ^C
Process 24027 detached
# gdb -p 24027
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 24027
Reading symbols from /usr/local/nutcracker/nutcracker...(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x7f705a9886e0 (LWP 24027)]
[New Thread 0x423cf940 (LWP 24028)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
(no debugging symbols found)
0x000000321b6d4018 in epoll_wait () from /lib64/libc.so.6
(gdb) bt
#0 0x000000321b6d4018 in epoll_wait () from /lib64/libc.so.6
#1 0x0000000000409427 in event_wait ()
#2 0x00000000004051a3 in core_loop ()
#3 0x000000000040ee6d in main ()
(gdb)
We run into that at Tumblr. Our work around is less than elegant.
@bmatheny how do you handle this at tumblr?
Embarrassingly :) This isn't quite as simple a fix as 'refuse connections if you run out of file descriptors'. The way we run twemproxy, this condition should be close to impossible in the general case, yet it happens with a bit of regularity (several times a week across about a thousand instances). Digging into this, tracking down what was actually causing the FD leak became time consuming so we opted to go with a 'simple' fix.
We already use monit for restarting twemproxy when the twem config changes. We just setup monit to also look for this out of FD condition and restart it in that case as well.
Hi we are running into this condition. Just wondering if there was any update after the last conversation? Is the restart only the solution?
@harishd one workaround is to increase the default file description limit. see http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
Thanks Manju! So as of now I have already changed my limits.conf I have the hard/soft limits to 60000. So this means that the server is probably running out of them too (>60000). Do you recommend adding more servers at this stage? or do we have any recommendations on the max hard/soft limits for twemproxy?
@harishd you probably want twemproxy to be deployed as local proxy (see slide 4 here: https://speakerdeck.com/justinmares/twemproxy). Deploying it as local proxy has better fault tolerance semantics and doesn't end up with file descriptor problem
Actually I think this may be a bug in twemproxy. My understanding is that open file descriptors should be on the order of: (pool count * client count) + (pool count * backend count * open connections to each backend). We found that there were occasions where with ~8 pools, 50 client connections per pool, 15 backends per pool, 20 connections per backend, we would run out of file descriptors with a limit of 8192. The upper bound on open file descriptors with that config should have been ~2800. Is my assumption about FD count incorrect?
@bmatheny You are right. The FD count is:
x = (pool_count * client_connections_per_pool) + (pool_count * backends_per_pool * server_connections)
Usually we want to set the file descriptor limit to 2x because of the TIME_WAIT lingering on a close -- https://github.com/twitter/twemproxy/blob/master/notes/socket.txt#L61-L87
But even with 2x number 2800 * 2 < 8192. So, I'm not sure why we ran out of the 8192 limit. We should be able to easily debug this scenario with a test program and lowering the file descriptor limit to a really low value (like -- $ulimit -n 32) to hit this scenario
Just a side note. Be aware, that /etc/security/limits.conf is used only for interactive user sessions. It does not change file descriptor limit for daemons. You can verify actual limits of a running process in /proc/PID/limits
twemproxy is great work, but this issue bothers us. We find a simple way to fix it: https://github.com/allenlz/twemproxy/commit/cb7bbf39ff1a700682e4c3cd0f25008b15dd307d
I update the new fix with a PR #232, which maintains a safe max number for client connections. Any comment is welcome.
I'm also encountering this issue. Reproducible using redis-benchmark -p 22121 -d 100 -c 1500 -r 20 -l -t set,get
(http://redis.io/topics/benchmarks). With too many simultaneous connections Twemproxy gets into a state where it refuses connections permanently, even after ending the benchmark (redis-cli no longer connects). @allenlz fix works for me; Twemproxy recovers and continues accepting connections.
It still doesn't accept 1500 connections though, despite changing ulimit.
Not a maintainer, but it may be possible to close this issue now that the linked PR is merged in 2014? (unless other things were planned)