mod_websocket icon indicating copy to clipboard operation
mod_websocket copied to clipboard

Segfault issue after connected with 500+ connections

Open nicktee89 opened this issue 10 years ago • 9 comments

Recently i was testing lighttpd + mod_websocket, however, when i try to stress connect the 1000 clients to it, it thrown a segfault error around 500+ th connections. lighttpd[2040]: segfault at 86 ip 00007f3f49b8b836 sp 00007fff79bd1570 error 6 in mod_websocket.so[7f3f49b84000+b000]

Anyone experienced this issue before?

nicktee89 avatar Sep 17 '14 07:09 nicktee89

Hi nicktee89,

I investigated and reproduced this.

(lldb) r -f ../etc/lighttpd.conf -m ../lib/ -D
Process 23055 launched: './lighttpd' (x86_64)
lighttpd(23055,0x7fff7eca9310) malloc: *** error for object 0x10070: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Process 23055 stopped
* thread #1: tid = 0x333e21, 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill + 10:
-> 0x7fff928c8866:  jae    0x7fff928c8870            ; __pthread_kill + 20
   0x7fff928c8868:  movq   %rax, %rdi
   0x7fff928c886b:  jmp    0x7fff928c5175            ; cerror_nocancel
   0x7fff928c8870:  retq   
(lldb) bt
* thread #1: tid = 0x333e21, 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff928c8866 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff98de535c libsystem_pthread.dylib`pthread_kill + 92
    frame #2: 0x00007fff963d9b1a libsystem_c.dylib`abort + 125
    frame #3: 0x00007fff9683307f libsystem_malloc.dylib`free + 411
    frame #4: 0x00007fff9563f792 libsystem_info.dylib`freeaddrinfo + 33
    frame #5: 0x00000001000e1b1a mod_websocket.so`mod_websocket_connect(host=, service=) + 650 at mod_websocket_socket.c:98
    frame #6: 0x00000001000e6069 mod_websocket.so`mod_websocket_handle_subrequest [inlined] connect_backend + 90 at mod_websocket.c:153
    frame #7: 0x00000001000e600f mod_websocket.so`mod_websocket_handle_subrequest(srv=0x0000000100801200, con=0x00000001002d1bc0, p_d=) + 2991 at mod_websocket.c:581
    frame #8: 0x000000010002081a liblightcomp.dylib`plugins_call_handle_subrequest(srv=0x0000000100801200, con=0x00000001002d1bc0) + 90 at plugin.c:272
    frame #9: 0x0000000100004812 lighttpd`http_response_prepare(srv=0x0000000100801200, con=0x00000001002d1bc0) + 3554 at response.c:765
    frame #10: 0x0000000100005855 lighttpd`connection_state_machine(srv=0x0000000100801200, con=) + 789 at connections.c:1430
    frame #11: 0x00000001000078ee lighttpd`network_server_handle_fdevent(srv=0x0000000100801200, context=0x0000000100115170, revents=) + 78 at network.c:72
    frame #12: 0x0000000100002f02 lighttpd`main(argc=, argv=) + 7298 at server.c:1489
    frame #13: 0x00007fff95a405fd libdyld.dylib`start + 1
    frame #14: 0x00007fff95a405fd libdyld.dylib`start + 1

The code at mod_websocket_socket.c:98 is freeaddrinfo(res) and res is allocated getaddrinfo at mod_websocket_socket.c:37.

And this occurs only when making with -O2 option.(When I tried with -O0, I can't reproduce this issue.This means ... I can't investigate this more elaborate...) I'll continue to investigate this, but I have no idea how to fix this now.

If does anyone have any idea, please let me know.

nori0428 avatar Oct 01 '14 03:10 nori0428

I found the cause of this issue. Please wait a while. ( But this seems to be a bug of libc/glibc.I'll file a bug at https://sourceware.org/bugzilla/ )

nori0428 avatar Oct 02 '14 00:10 nori0428

On Oct 1, 2014, at 8:12 PM, Norio Kobota wrote:

I found the cause of this issue.

Could you please share some details about the cause and the libc bug?

Thanks, Phil

philshafer avatar Oct 02 '14 00:10 philshafer

According to https://sourceware.org/bugzilla/show_bug.cgi?id=10352

mod_websocket uses 'select' to connect to a backend server and uses 'FD_SET' for checking connect-timeout at mod_websocket_socket.c:63. When fd is above 1024, FD_SET occurs buffer over flow. Certainly, I can fix this issue by using FD_SETSIZE, but I think that this is FD_SET's bug.

ex. Belows are MAC OS's FD_SET MACROS and typedef of fd_set structure.

/usr/include/sys/_tyeps/_fd_def.h

typedef struct fd_set {
    __int32_t   fds_bits[__DARWIN_howmany(__DARWIN_FD_SETSIZE, __DARWIN_NFDBITS)];
} fd_set;

#define __DARWIN_FD_SET(n, p)   do { int __fd = (n); ((p)->fds_bits[(unsigned long)__fd/__DARWIN_NFDBITS] |= ((__int32_t)(1<<((unsigned long)__fd % __DARWIN_NFDBITS)))); } while(0)

__DARWIN_howmany(__DARWIN_FD_SETSIZE, __DARWIN_NFDBITS) returns 32 by default. And (unsigned long)__fd/__DARWIN_NFDBITS returns over 32 when fd > 1024.

And other Linux's (CentOS etc.) SET_FD macro is same as MacOS's impl.(But I checked only CentOS 6.5)

I 'm bit busy now and I'm not good at English well, so I'm glad if someone reports.

nori0428 avatar Oct 02 '14 01:10 nori0428

Ah sorry.I didn't know belows.

man 2 select ( http://linux.die.net/man/2/select ) An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.

Although I think this is a bug of libc still(ex. Windows's FD_SET is different impl.), I'll fix this issue by not using 'select'.

nori0428 avatar Oct 02 '14 08:10 nori0428

I could fix this by using modulus operator.

http://stackoverflow.com/questions/7976388/increasing-limit-of-fd-setsize-and-select

nori0428 avatar Oct 02 '14 12:10 nori0428

p.s. 'SET_FD' and 'select' on Windows is differ from Linux/BSD, so this fix can not apply to Windows systems.

nori0428 avatar Oct 02 '14 12:10 nori0428

On Oct 2, 2014, at 8:24 AM, Norio Kobota @nori0428 wrote:

Closed #40 via 62d07f0.

Using modulus avoids the invalid memory reference, but doesn't address the underlaying problem. When the program wants to select socket #1025, you will be telling select to look at socket #1. When input or output appears for socket #1025, select() will not care or return.

To access more that FD_SETSIZE sockets, you'll need to move to using poll(2) (or kevent(2)).

Thanks, Phil

philshafer avatar Oct 02 '14 17:10 philshafer

Thank you always Phil,

Using modulus avoids the invalid memory reference, but doesn't address the underlaying problem.

Ah ... OK.I recognized this. But in this case, I thought that there is no problem. The reason is that mod_websocket uses 'select' only whether socket is connected or not.

But when socket #1 is not connected and #1025 is connected, this occurs a problem.

To access more that FD_SETSIZE sockets, you'll need to move to using poll(2) (or kevent(2)).

Thanks... But I don't know that old systems support poll or kevent.And I don't want to use #ifdef for switching OS.

Anyway, I'll fix this by using poll and stay open this issue.Please contact again if there is a problem.

nori0428 avatar Oct 03 '14 00:10 nori0428