lws_context_destroy () Closing the FD twice causes the process to crash
I have a WSS gateway service using 'libwebsockets-4.3-stable'. Recently received a crash report of a production line. The crash is triggered by GLIBC, it catch an error when the other thread call getifaddrs. It get such error:"Unexpected error 9 on netlink descriptor 19.\n"
#0 0x00007f64a3651aff in raise () from /lib64/libc.so.6 #1 0x00007f64a3624ea5 in abort () from /lib64/libc.so.6 #2 0x00007f64a3694097 in __libc_message () from /lib64/libc.so.6 #3 0x00007f64a369415a in __libc_fatal () from /lib64/libc.so.6 #4 0x00007f64a374fc44 in __netlink_assert_response () from /lib64/libc.so.6 #5 0x00007f64a374c762 in __netlink_request () from /lib64/libc.so.6 #6 0x00007f64a374c901 in getifaddrs_internal () from /lib64/libc.so.6 #7 0x00007f64a374d608 in getifaddrs () from /lib64/libc.so.6 #8 0x00007f64a47ecdd0 in bsd_localinfo (return_result=0x7f649d12a6b8, hints=0x7f649d12a6f0) at su_localinfo.c:1167 #9 su_getlocalinfo (hints=hints@entry=0x7f649d12a7d0, return_localinfo=return_localinfo@entry=0x7f649d12a7c8) at su_localinfo.c:242 #10 0x00007f64a47ca9ea in soa_init_sdp_connection_with_session (ss=ss@entry=0x7f64880603a0, c=0x7f649d12a940, buffer=buffer@entry=0x7f649d12a9a0 "10.10.50.52"
I further analyzed and found that the scenario triggered by this error is as follows: Thread A closes a file descriptor. Thread B calls getaddrinfo and opens a Netlink socket. It happens to receive the same descriptor value. Due to a bug, thread A closes the same file descriptor again. Normally, that would be benign, but due to the concurrent execution, the Netlink socket created by glibc is closed. Thread B attempts to use the Netlink socket descriptor and receives the EBADF error.
I further analyzed and found that the lws_context_destroy () call will close the same FD twice,The following is the call stack corresponding to closing FD=19 twice(line 1856 and line 1936 in context.c):
#0 close (fd=19) at co_hook_sys_call.cpp:336 #1 0x00007ff4290bf320 in __lws_close_free_wsi_final (wsi=0x7ff41010e2d0) at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/core-net/close.c:884 #2 0x00007ff4290bf275 in __lws_close_free_wsi (wsi=0x7ff41010e2d0, reason=LWS_CLOSE_STATUS_NOSTATUS_CONTEXT_DESTROY, caller=0x7ff4290ff42f "ctx destroy") at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/core-net/close.c:870 #3 0x00007ff4290bf6fc in lws_close_free_wsi (wsi=0x7ff41010e2d0, reason=LWS_CLOSE_STATUS_NOSTATUS_CONTEXT_DESTROY, caller=0x7ff4290ff42f "ctx destroy") at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/core-net/close.c:1005 #4 0x00007ff4290acfc5 in lws_context_destroy (context=0x7ff410068250) at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/core/context.c:1856
#0 close (fd=19) at co_hook_sys_call.cpp:336 #1 0x00007ff4290a0e3c in lws_plat_pipe_close (wsi=0x7ff421ffa7f0) at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/plat/unix/unix-pipe.c:88 #2 0x00007ff4290acc82 in lws_pt_destroy (pt=0x7ff4100684d0) at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/core/context.c:1689 #3 0x00007ff4290ad1fb in lws_context_destroy (context=0x7ff410068250) at /GIT/unimrcp/3rd-libs/libwebsockets-4.3-stable/lib/core/context.c:1936
Has this problem since been fixed?
I pushed a patch on main + v4.3-stable that should help with this.
There is a memory leak issue with https://github.com/warmcat/libwebsockets/commit/b486c2b545665b3174f7a466b4072b2a60916ed2
How can I reproduce that?
Make one test program with libwebsockets,
1/ Create 2000 threads with 2000 lws clients
2/ Connect to minimal-ws-server-threads
3/ Each connection keep 10 seconds, then disconnect and reconnect
4/ Keep 10 min total, then too many memory used.
This test includes two types memory leak issue.
lws simply isn't threadsafe, so you will have all kinds of problems if you tried to do that.