librealsense [BUG] Deadlock when using multiple RealSense cameras (librealsense thread sanitizer report)

Hi,

I have been in contact with Yunsheng from Zendesk about this issue. I am now posting it here in the hopes of getting a resolution, as this issue is now bringing our system to its knees after we migrated to ROS2, which obviously changed some internal timing.

We identified a deadlock in librealsense when operating a system with multiple cameras (4x RealSense, 4x RGB cameras) on a custom Jetson AGX Orin platform. This deadlock can also cause the kernel-side I2C to degrade due to orphaned kernel-side locks.

The deadlock is caused by an inconsistent locking order between:

platform::uvc_device mutex
platform::named_mutex used inside invoke_powered()

Root Cause:

invoke_powered() locks named_mutex (M1).
Inside its lambda, code locks uvc_device (M0) afterwards.
In other paths, uvc_device is locked before calling into functions that trigger invoke_powered().
This creates an M1→M0 vs M0→M1 inversion ➔ Deadlock.
There are other instances of inconsistent locking which also appear once initial trigger is patched (more fundamental issue with locking in parts of librealsense)

ThreadSanitizer (TSan) confirms the inversion reliably.

Environment

Item	Value
librealsense Version	2.56.4
OS	Ubuntu 20.04
Kernel	5.10.216 (Cartken modified kernel)
Platform	NVIDIA Jetson AGX Orin
Cameras Connected	4x RealSense D400-series + 4x additional RGB cameras (non-RealSense)
Build Type	CMake + Make with `-fsanitize=thread`

Steps to Reproduce

Connect 4 RealSense cameras + 4 other RGB cameras via FAKRA. For us it is via a MAX9295 + MAX9296 SerDes.
Start simultaneous video streams on all cameras.
Open and initialise all RealSense cameras via the SDK (using rs2::pipeline, device_list, etc.).
Under load (critical), observe deadlock or hanging behaviour.
TSan reports lock inversion between uvc_device and named_mutex.

Minimal Problem Example

// BAD pattern (causing deadlock)
return _uvc.invoke_powered([this](platform::uvc_device & dev) {
    std::lock_guard<platform::uvc_device> lock(dev); // ⚠️ too late
    ...
});

Corrected Safe Pattern:

// GOOD pattern
auto dev = _uvc.get_device();
std::lock_guard<platform::uvc_device> lock(*dev); // 🔒 M0 first

return _uvc.invoke_powered_unlocked([&](platform::uvc_device & dev) {
    // Safe: device lock already held
});

Band-aid Patch we're using (doesn't solve the issue)

From e59f5975b8b22aa181e6b0df1bf471bd9da70bd7 Mon Sep 17 00:00:00 2001
From: Alexander Hoffman <[email protected]>
Date: Wed, 23 Apr 2025 17:06:03 +0200
Subject: [PATCH] [TMP] Remove invoke_powered deadlock

command_transfer_over_xu::send_receive uses
uvc_sensor::invoke_powered with a lambda and inside
invoke_powered named_mutex is taken inside the body
of the function template before the lamda is invoked,
in which the uvc_device lock is taken. Elsewhere in
librealsense the uvc_device lock is always obtained
before named_mutex, this this creates a deadlock
---
 src/uvc-sensor.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/uvc-sensor.h b/src/uvc-sensor.h
index f2370d56b..bab0fab70 100644
--- a/src/uvc-sensor.h
+++ b/src/uvc-sensor.h
@@ -39,6 +39,7 @@ public:
     template< class T >
     auto invoke_powered( T action ) -> decltype( action( *static_cast< platform::uvc_device * >( nullptr ) ) )
     {
+        std::lock_guard<platform::uvc_device> lock(*_device);
         power on( std::dynamic_pointer_cast< uvc_sensor >( shared_from_this() ) );
         return action( *_device );
     }
-- 
2.49.0

TSan Output (summary)

tsan.log

Detected multiple cycles M0 ➔ M1 and M1 ➔ M0.
Locations:
- command_transfer_over_xu::send_receive
- locked_transfer::send_receive
- group_multiple_fw_calls
invoke_powered() locking the named_mutex
Lambda locking the uvc_device inside the lambda → causes lock inversion.

Additional Notes

This issue is more likely to trigger under multi-camera, multi-threaded loads.
Fixing the lock order significantly improves stability and prevents random system hangs.
The locking in this part of librealsense also doesn't make logical sense, sometimes locking a "low level" device before locking a "higher level" data structure relating to the device. I think that a slight overhaul of the locking logic would be very beneficial.

Apr 28 '25 08:04 alxhoff

@TimStricker

Apr 28 '25 08:04 alxhoff

@alxhoff thanks for reporting this issue. I am not sure I understand if you proposed a working solution? If so feel free to submit a PR and we will review it. Otherwise we will investigate once possible

Apr 28 '25 18:04 Nir-Az

Hi @Nir-Az,

No, not a complete solution as some refactoring would be needed. For the one particular inversion, I took the one lock before the other, and being recursive, it still retakes the lock, which isn't ideal. But we also delayed the bringup of librealsense until we had less system load, which isn't a real solution, but will allow our realsense to come online until there is a proper solution.

Apr 29 '25 07:04 alxhoff

Hi @alxhoff,

I am trying to assemble a system that reproduces the bug in order to verify any suggested solution actually solves the issue. Can you please elaborate on the deadlock?

What is your scenario? tsan.log shows possible deadlock during initialization, however, I assumed that system under load was already streaming.
What are the issues you observe?
Can the system be recovered by resetting one (or more) of the cameras?

Thank you

Jun 04 '25 08:06 OhadMeir

Hey. thanks for the reply. So to answer your questions:

We have a jetson Orin AGX with 4x RGB (non-realsense) and 4x D430 cameras on a D457 backbone all connected via GMSL to an expansion board where the cameras are connected via MAXIM serdes. And yes, the problem for us is that the race condition becomes apparent when the system is under load. Our temporary in-house solution for us has been to delay librealsense initialisation until the system load has decreased a bit and this seems to let us get past the race condition, obviously just a band-aid for a large issue here. This has become an issue for us since migrating to ros2 which seems to create larger system loads during init and as such it cause this race condition to surface.
We observe IOCTL errors coming from the kernel side but these are due to devices being unresponsive due to being locked up by userspace i2c calls that are deadlocking and causing the kernel side to have dangling i2c-subsystem locks.
I am not sure what you mean by reset, we do not have cyclable power connections to the cameras and the d4xx kernel driver is not really designed well for being hotswappable. As far as a software reset through librealsense goes this sadly also depends on the kernel side i2c which is locked up thus making a userspace->kernel software reset impossible.

Hope this helps. My recommendation would be tsan analysis of the SDK, there are a few clear inversions there. Personally if I had the time I would also reasses the locking logic inside the SDK as it doesn't always make sense, sometimes it seems to lock at a low (device near) level then lock at a high level (file system near) which IMO is the wrong order.

Cheers,

Alex

Jun 06 '25 07:06 alxhoff

Hi @alxhoff

We were unable to reproduce a deadlock state in our labs. However, going over the code, I was able to refactor it and remove some of the locks. There is a pending PR with the changes https://github.com/IntelRealSense/librealsense/pull/14095, please try it and check if this solves your issue.

Thanks Ohad

Jun 25 '25 17:06 OhadMeir