bedrock-protocol icon indicating copy to clipboard operation
bedrock-protocol copied to clipboard

Critical Application Hang due to Suspected Deadlock in raknet-native Dependency Bug Description

Open Hmza1112617 opened this issue 4 months ago • 2 comments

Hello bedrock-protocol team, We are experiencing a critical and persistent application hang when using the library to run a server. After a certain period of uptime and activity, the entire Node.js process becomes completely unresponsive and requires a manual kill signal to terminate. The process does not crash or log any errors; it simply freezes indefinitely. Our analysis strongly suggests this is a native-level deadlock occurring within the raknet-native C++ dependency, which is used for networking. Steps to Reproduce We are still working on creating a minimal reproducible example (MRE), but the hang occurs reliably in our production environment under the following conditions:

  • Set up a server using bedrock-protocol.

  • Allow multiple clients to connect and interact with the server.

  • The server will eventually hang after a period of sustained activity (ranging from a few hours to a day). The trigger is not yet consistently pinpointed to a single action. We will update this issue as soon as we have a reliable MRE. Expected Behavior The server should remain stable and responsive, handling client connections and game logic without freezing. Actual Behavior The Node.js process hangs indefinitely, consuming CPU and memory but being completely unresponsive to any network traffic or commands. We have observed the hang lasting for over 24 hours. Technical Analysis & Logs We ran the hung process under Valgrind (PID 12191), and the output reveals that numerous threads spawned by raknet-native are blocked indefinitely in synchronization-related system calls. Key Findings:

  • Multiple threads (Valgrind threads 80 through 92) are in the VgTs_WaitSys state.

  • The vast majority of these threads are stuck in pthread_cond_timedwait, originating from core RakNet functions like RakClient::RunLoop, RakNet::UpdateNetworkLoop, and RakSleep.

  • The dedicated network I/O threads are blocked in recvfrom, which is expected behavior for a listening thread but confirms they are part of the overall hung system.

  • One thread (Thread 83) is notably stuck deep inside a pthread_mutex_unlock call, suggesting a highly complex lock contention scenario or a fundamental deadlock. Primary Hypothesis Our primary hypothesis is a deadlock or a severe condition variable mismanagement issue within the raknet-native C++ codebase. The fact that many threads performing similar tasks are all blocked suggests a systemic problem with a shared resource, a central signaling mechanism, or an inconsistent lock-ordering pattern. Sample Stack Traces from Valgrind Here are representative stack traces of the blocked threads:

  • Thread waiting on a condition variable (typical case): --12191:101:MHEAP-- VgTs_WaitSys<80>(tid=12284) --12191:101:MHEAP-- at 0x49669A2: __pthread_cond_timedwait_common (pthread_cond_timedwait.c:595) --12191:101:MHEAP-- by 0x4966B2E: pthread_cond_timedwait@@GLIBC_2.3.2 (pthread_cond_timedwait.c:699) --12191:101:MHEAP-- by 0x51E2838: RakSleep(unsigned int) (in /app/node_modules/raknet-native/build/Release/raknet-native.node) --12191:101:MHEAP-- by 0x51E2838: RakClient::RunLoop() (in /app/node_modules/raknet-native/build/Release/raknet-native.node) --12191:101:MHEAP-- by 0x524459F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30) --12191:101:MHEAP-- by 0x4960AC2: start_thread (pthread_create.c:442) --12191:101:MHEAP-- by 0x4A13A03: clone (clone.S:100)

  • Thread waiting for network data: --12191:102:MHEAP-- VgTs_WaitSys<81>(tid=12285) --12191:102:MHEAP-- at 0x49FBD4F: recvfrom (recvfrom.c:28) --12191:102:MHEAP-- by 0x51E919D: RakNet::RNS2_Berkley::RecvFromBlockingIPV4(RakNet::RNS2RecvStruct*) (in /app/node_modules/raknet-native/build/Release/raknet-native.node) --12191:102:MHEAP-- by 0x51E919D: RakNet::RNS2_Berkley::RecvFromLoopInt() (in /app/node_modules/raknet-native/build/Release/raknet-native.node) --12191:102:MHEAP-- by 0x51E919D: RakNet::RNS2_Berkley::RecvFromLoop(void*) (in /app/node_modules/raknet-native/build/Release/raknet-native.node) --12191:102:MHEAP-- by 0x4960AC2: start_thread (pthread_create.c:442) --12191:102:MHEAP-- by 0x4A13A03: clone (clone.S:100)

We are actively pursuing further diagnostics using Helgrind and GDB and will post any additional findings. We would be very grateful for any insights or guidance you might have on this issue. We are available to provide more logs or help in any way we can. Thank you for your time and for maintaining this great library.

Hmza1112617 avatar Jun 11 '25 10:06 Hmza1112617