mod_websocket
mod_websocket copied to clipboard
Segfault issue after connected with 500+ connections
Recently i was testing lighttpd + mod_websocket, however, when i try to stress connect the 1000 clients to it, it thrown a segfault error around 500+ th connections. lighttpd[2040]: segfault at 86 ip 00007f3f49b8b836 sp 00007fff79bd1570 error 6 in mod_websocket.so[7f3f49b84000+b000]
Anyone experienced this issue before?
Hi nicktee89,
I investigated and reproduced this.
(lldb) r -f ../etc/lighttpd.conf -m ../lib/ -D Process 23055 launched: './lighttpd' (x86_64) lighttpd(23055,0x7fff7eca9310) malloc: *** error for object 0x10070: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug Process 23055 stopped * thread #1: tid = 0x333e21, 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT frame #0: 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10 libsystem_kernel.dylib`__pthread_kill + 10: -> 0x7fff928c8866: jae 0x7fff928c8870 ; __pthread_kill + 20 0x7fff928c8868: movq %rax, %rdi 0x7fff928c886b: jmp 0x7fff928c5175 ; cerror_nocancel 0x7fff928c8870: retq (lldb) bt * thread #1: tid = 0x333e21, 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff98de535c libsystem_pthread.dylib`pthread_kill + 92 frame #2: 0x00007fff963d9b1a libsystem_c.dylib`abort + 125 frame #3: 0x00007fff9683307f libsystem_malloc.dylib`free + 411 frame #4: 0x00007fff9563f792 libsystem_info.dylib`freeaddrinfo + 33 frame #5: 0x00000001000e1b1a mod_websocket.so`mod_websocket_connect(host=, service=) + 650 at mod_websocket_socket.c:98 frame #6: 0x00000001000e6069 mod_websocket.so`mod_websocket_handle_subrequest [inlined] connect_backend + 90 at mod_websocket.c:153 frame #7: 0x00000001000e600f mod_websocket.so`mod_websocket_handle_subrequest(srv=0x0000000100801200, con=0x00000001002d1bc0, p_d=) + 2991 at mod_websocket.c:581 frame #8: 0x000000010002081a liblightcomp.dylib`plugins_call_handle_subrequest(srv=0x0000000100801200, con=0x00000001002d1bc0) + 90 at plugin.c:272 frame #9: 0x0000000100004812 lighttpd`http_response_prepare(srv=0x0000000100801200, con=0x00000001002d1bc0) + 3554 at response.c:765 frame #10: 0x0000000100005855 lighttpd`connection_state_machine(srv=0x0000000100801200, con=) + 789 at connections.c:1430 frame #11: 0x00000001000078ee lighttpd`network_server_handle_fdevent(srv=0x0000000100801200, context=0x0000000100115170, revents=) + 78 at network.c:72 frame #12: 0x0000000100002f02 lighttpd`main(argc=, argv=) + 7298 at server.c:1489 frame #13: 0x00007fff95a405fd libdyld.dylib`start + 1 frame #14: 0x00007fff95a405fd libdyld.dylib`start + 1
The code at mod_websocket_socket.c:98 is freeaddrinfo(res) and res is allocated getaddrinfo at mod_websocket_socket.c:37.
And this occurs only when making with -O2 option.(When I tried with -O0, I can't reproduce this issue.This means ... I can't investigate this more elaborate...) I'll continue to investigate this, but I have no idea how to fix this now.
If does anyone have any idea, please let me know.
I found the cause of this issue. Please wait a while. ( But this seems to be a bug of libc/glibc.I'll file a bug at https://sourceware.org/bugzilla/ )
On Oct 1, 2014, at 8:12 PM, Norio Kobota wrote:
I found the cause of this issue.
Could you please share some details about the cause and the libc bug?
Thanks, Phil
According to https://sourceware.org/bugzilla/show_bug.cgi?id=10352
mod_websocket uses 'select' to connect to a backend server and uses 'FD_SET' for checking connect-timeout at mod_websocket_socket.c:63. When fd is above 1024, FD_SET occurs buffer over flow. Certainly, I can fix this issue by using FD_SETSIZE, but I think that this is FD_SET's bug.
ex. Belows are MAC OS's FD_SET MACROS and typedef of fd_set structure.
/usr/include/sys/_tyeps/_fd_def.h
typedef struct fd_set {
__int32_t fds_bits[__DARWIN_howmany(__DARWIN_FD_SETSIZE, __DARWIN_NFDBITS)];
} fd_set;
#define __DARWIN_FD_SET(n, p) do { int __fd = (n); ((p)->fds_bits[(unsigned long)__fd/__DARWIN_NFDBITS] |= ((__int32_t)(1<<((unsigned long)__fd % __DARWIN_NFDBITS)))); } while(0)
__DARWIN_howmany(__DARWIN_FD_SETSIZE, __DARWIN_NFDBITS) returns 32 by default. And (unsigned long)__fd/__DARWIN_NFDBITS returns over 32 when fd > 1024.
And other Linux's (CentOS etc.) SET_FD macro is same as MacOS's impl.(But I checked only CentOS 6.5)
I 'm bit busy now and I'm not good at English well, so I'm glad if someone reports.
Ah sorry.I didn't know belows.
man 2 select ( http://linux.die.net/man/2/select ) An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.
Although I think this is a bug of libc still(ex. Windows's FD_SET is different impl.), I'll fix this issue by not using 'select'.
I could fix this by using modulus operator.
http://stackoverflow.com/questions/7976388/increasing-limit-of-fd-setsize-and-select
p.s. 'SET_FD' and 'select' on Windows is differ from Linux/BSD, so this fix can not apply to Windows systems.
On Oct 2, 2014, at 8:24 AM, Norio Kobota @nori0428 wrote:
Closed #40 via 62d07f0.
Using modulus avoids the invalid memory reference, but doesn't address the underlaying problem. When the program wants to select socket #1025, you will be telling select to look at socket #1. When input or output appears for socket #1025, select() will not care or return.
To access more that FD_SETSIZE sockets, you'll need to move to using poll(2) (or kevent(2)).
Thanks, Phil
Thank you always Phil,
Using modulus avoids the invalid memory reference, but doesn't address the underlaying problem.
Ah ... OK.I recognized this. But in this case, I thought that there is no problem. The reason is that mod_websocket uses 'select' only whether socket is connected or not.
But when socket #1 is not connected and #1025 is connected, this occurs a problem.
To access more that FD_SETSIZE sockets, you'll need to move to using poll(2) (or kevent(2)).
Thanks... But I don't know that old systems support poll or kevent.And I don't want to use #ifdef for switching OS.
Anyway, I'll fix this by using poll and stay open this issue.Please contact again if there is a problem.