IXWebSocket icon indicating copy to clipboard operation
IXWebSocket copied to clipboard

Segmentation fault when using static binaries on unix.

Open zhayka52 opened this issue 3 years ago • 10 comments

There appears to be a threading issues within the IXDNSLookup.cpp when IXWebSocket is linked as a static library.

To recreate the problem, use the main.cpp example code, and change the server URL to 127.0.0.1:8000 or any other address that does not have a running websocket server. The target platform is Ubuntu 20.04.

Using the standard dynamic linked binary, you will see something like

Connecting to ws://127.0.0.1:8000...

Connection error: Unable to connect to 127.0.0.1 on port 8000, error: Connect error: Connection refused

This message will repeat over and over again as expected.

Now, create a statically linked version of the same main.cpp

Connecting to ws://127.0.0.1:8000...

Connection error: Unable to connect to 127.0.0.1 on port 8000, error: Connect error: Connection refused Segmentation fault (core dumped)

Upon closer examination, the culprit appears to be DNS related. Given what has been reported in ticket 362, there might be something going with resource allocation that requires more investigation.

This change that seems to work.

struct addrinfo* DNSLookup::resolve(std::string& errMsg,
                                    const CancellationRequest& isCancellationRequested,
                                    bool cancellable)
{
    // THIS does not work
    ///return cancellable ? resolveCancellable(errMsg, isCancellationRequested)
                       //: resolveUnCancellable(errMsg, isCancellationRequested);

   // THIS does work
    return resolveUnCancellable(errMsg, isCancellationRequested);
}

zhayka52 avatar Feb 16 '22 22:02 zhayka52

I can't reproduce that on my mac with clang, and on ubuntu linux 21.10 with gcc-11.

I used the ws example. I just found a bug with the retry logic which I just fixed.

bsergean avatar Feb 17 '22 17:02 bsergean

Can you dump the coredump, or run this in gdb ?

bsergean avatar Feb 17 '22 17:02 bsergean

This bug only shows up when the binary is statically linked. I can probably test on other platforms and see if I can the same results.

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". ixwebsocket/11.3.2 linux ssl/mbedtls 2.16.5 zlib 1.2.11 Connecting to ws://127.0.0.1:8000... [New Thread 0x7ffff7ffa700 (LWP 26793)]

Connection error: Unable to connect to 127.0.0.1 on port 8000, error: Connect error: Connection refused

Thread 2 "ws://127.0.0.1:" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff7ffa700 (LWP 26793)] 0x0000000000000000 in ?? () (gdb) where #0 0x0000000000000000 in ?? () #1 0x000000000040e5b1 in ix::WebSocket::checkConnection(bool) () #2 0x000000000040f608 in ix::WebSocket::run() () #3 0x0000000000537dc4 in execute_native_thread_routine () #4 0x000000000056b139 in start_thread (arg=) at pthread_create.c:477 #5 0x0000000000608ac3 in clone ()

zhayka52 avatar Feb 17 '22 17:02 zhayka52

This is also confirmed on Centos 7.9.

gdb) run

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". ixwebsocket/11.4.0 linux nossl Connecting to ws://echo.websocket.org... [New Thread 0x7ffff7ffb700 (LWP 3080)]

[New Thread 0x7ffff77f9700 (LWP 3081)] [Thread 0x7ffff77f9700 (LWP 3081) exited] Connection error: Expecting status 101 (Switching Protocol), got 200 status connecting to ws://echo.websocket.org, HTTP Status line: HTTP/1.1 200 OK

[New Thread 0x7ffff77f9700 (LWP 3082)]

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff77f9700 (LWP 3082)] 0x00007ffff6de15ed in internal_getent () from /lib64/libnss_files.so.2 Missing separate debuginfos, use: debuginfo-install glibc-2.17-325.el7_9.x86_64 (gdb)

I had checked out the latest build from github and made the follow changes


diff --git a/CMakeLists.txt b/CMakeLists.txt
index e7341bc..7e09a50 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -12,7 +12,7 @@ set (CMAKE_CXX_STANDARD 11)
 set (CXX_STANDARD_REQUIRED ON)
 set (CMAKE_CXX_EXTENSIONS OFF)
 
-option (BUILD_DEMO OFF)
+set (BUILD_DEMO ON)
 
 if (${CMAKE_SYSTEM_NAME} MATCHES "Linux")
   set(CMAKE_POSITION_INDEPENDENT_CODE ON)
@@ -233,6 +233,7 @@ endif()
 
 option(USE_ZLIB "Enable zlib support" TRUE)
 
+set (USE_ZLIB false)
 if (USE_ZLIB)
   # This ZLIB_FOUND check is to help find a cmake manually configured zlib
   if (NOT ZLIB_FOUND)
@@ -315,5 +316,5 @@ endif()
 
 if (BUILD_DEMO) 
   add_executable(demo main.cpp)
-  target_link_libraries(demo ixwebsocket) 
+  target_link_libraries(demo ixwebsocket -static) 
 endif()
diff --git a/main.cpp b/main.cpp
index 8512537..4278c97 100644
--- a/main.cpp
+++ b/main.cpp
@@ -28,7 +28,8 @@ int main()
 
     // Connect to a server with encryption
     // See https://machinezone.github.io/IXWebSocket/usage/#tls-support-and-con
-    std::string url("wss://echo.websocket.org");
+    //std::string url("wss://echo.websocket.org");
+    std::string url("ws://echo.websocket.org");
     webSocket.setUrl(url);
 
     std::cout << ix::userAgent() << std::endl;

zhayka52 avatar Feb 17 '22 21:02 zhayka52

Is the executable a static executable ?

What does ldd my_exe says ?

bsergean avatar Feb 28 '22 21:02 bsergean

The binary is statically linked.

demo: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=5f9873590e43f36045261afb005075298df00dbe, for GNU/Linux 3.2.0, with debug_info, not stripped

Can you try to replicate the problem using the git diff output provided in this thread? This issue can be consistently reproduced in at least Ubuntu and CentOS.

While a work around has been found, it might significantly impact DNS performance. The current design appears to allow more efficient lookup by cancelling duplicated calls. Is that what the code is doing?

zhayka52 avatar Feb 28 '22 22:02 zhayka52

I see. I never tested fully statically linked binaries. I've seen odd threading behavior with threading (pthread) and static binaries so this isn't too surprising.

The code allows to 'cancel' a hanged dns query, it isn't really for performance reason that the dns/getaddrinfo happens on a background thread. On Mobile platform blocking the main thread is forbidden so this is why we're doing this.

Maybe a CMake option would be the best way to support this, so that you don't have to use a fork of the library. Or we can try to see if there's a simple modification we can make to make this work.

Sometimes the ordering of the libraries is important, not sure it's the case here. Are you building glibc yourself ?

bsergean avatar Mar 24 '22 15:03 bsergean

gLibC is the default one from the OS. If the resolveUnCancellable can be used safely used in all system other than Android, then maybe #ifdef (ANDROID) #else block is the easiest way to handle this condition.

zhayka52 avatar Mar 25 '22 01:03 zhayka52

This code has been running fine on Android, so it's really a problem when making static binaries.

On Android typically native code will compiled as a shared library, which is why this problem is avoided.

bsergean avatar Apr 30 '22 17:04 bsergean

Stale issue message

github-actions[bot] avatar Jun 30 '22 00:06 github-actions[bot]