dashmap icon indicating copy to clipboard operation
dashmap copied to clipboard

Ways to debug deadlocks

Open gitmalong opened this issue 2 years ago • 13 comments
trafficstars

Hi!

What is the recommended approach to debug dead locks associated with Dashmap? Is there some tooling for that purpose?

Thanks

gitmalong avatar Dec 29 '22 16:12 gitmalong

Hey, so this can be a bit tricky. Usually I just fire up lmdb and dump the backtraces of every thread when they happen and go through their callstacks. Does that help?

xacrimon avatar Dec 29 '22 17:12 xacrimon

Thanks I will give that a try. For tokio::sync::RwLock the tokio-console crate can be used. Think my issue is not related to this library so I gonna close the issue.

gitmalong avatar Dec 29 '22 18:12 gitmalong

Could something like this be integrated into Dashmap for debugging purposes? https://lib.rs/crates/no_deadlocks

gitmalong avatar Dec 29 '22 21:12 gitmalong

Hey, so this can be a bit tricky. Usually I just fire up lmdb and dump the backtraces of every thread when they happen and go through their callstacks. Does that help?

Do you have any references or doc for that (lmdb)?

gitmalong avatar Dec 29 '22 22:12 gitmalong

I wanted to give https://github.com/BurtonQin/lockbud a try but it does not work on Mac OS (https://github.com/BurtonQin/lockbud). Which approach do you normally take to get the backtraces of each thread?

gitmalong avatar Dec 30 '22 09:12 gitmalong

Sorry, I made a typo. I rely on the lldb debugger to do this. I run my program and then do thread apply all bt and sift through the backtraces.

xacrimon avatar Dec 30 '22 09:12 xacrimon

I ran (lldb) thread backtrace all and probably found the dead lock call that I also figured out through my logs. However I can't find another lock that blocks.

thread #8, name = 'tokio-runtime-worker'
    frame #0: 0x000000019209e5e4 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x00000001920da638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x00000001010708cc butterbrot`dashmap::lock::RawRwLock::lock_exclusive_slow::h3eceab46f26c3724 + 624
    frame #3: 0x0000000100a4e75c butterbrot`butterbrot_rust::core::load_and_store::_$u7b$$u7b$closure$u7d$$u7d$::h07759cdff2954bfc + 3948

Is it correct that there must be another RawRwLock or dashmap::lock in the backtraces to confirm that there is a Deadlock (if they relate to the same dashmap)? Unfortunately I have not found another one.

gitmalong avatar Dec 30 '22 10:12 gitmalong

Hi @xacrimon .

In https://github.com/xacrimon/dashmap/issues/79 @notgull says

I think "don't hold a lock across a .await" should be documented.

Might this be the root cause of my issue cause I have something like:

let account = dm.get_mut(&account);

if let Some(mut a) = account {
      a.value_mut().save("update").await; // <-- Holding accross await point 
}

gitmalong avatar Dec 31 '22 07:12 gitmalong

Reproducer


    #[tokio::test]
    async fn dashmap_async_test() {
        struct CanAsyncSave {};
        impl CanAsyncSave {
            pub async fn save(&mut self) {
                tokio::time::sleep(Duration::from_millis(1)).await;
            }
        }
        let dm: Arc<DashMap<String, CanAsyncSave>> = Default::default();
        let dm_clone = dm.clone();

        tokio::task::spawn(async move {
            for _ in 0..100 {
                let mut entry = dm_clone.get_mut("1").unwrap();
                let val = entry.value_mut();
                val.save().await;
            }
        });

        for _ in 0..100 {
            dm.insert("1".into(), CanAsyncSave {});
            let mut entry = dm.get_mut("1").unwrap();
            let val = entry.value_mut();
            val.save().await;
        }
    }

gitmalong avatar Dec 31 '22 08:12 gitmalong

Hi @xacrimon .

In #79 @notgull says

I think "don't hold a lock across a .await" should be documented.

Might this be the root cause of my issue cause I have something like:

let account = dm.get_mut(&account);

if let Some(mut a) = account {
      a.value_mut().save("update").await; // <-- Holding accross await point 
}

Sounds pretty similar to if you were using std mutex, parking_lot, etc., where you want to avoid locking across an await. In cases like that i would just stick with a lock that is await aware, avoid awaiting on it. or change functionality around

dariusc93 avatar Jan 03 '23 22:01 dariusc93

This very helpful post covers how to make Dashmap not deadlock in async code: https://draft.ryhl.io/blog/shared-mutable-state/ It also explains why the compiler does not warn about these deadlocks.

The gist is never to await in anything while holding a lock. A lock is taken when accessing the Dashmap, and released when the guard is dropped... If I got it right that is.

The recommended approach is to never access Dashmap directly in async code, but through a convenience wrapper.

matildasmeds avatar Jul 23 '23 18:07 matildasmeds

Ran into this as I was looping over my DashMap's iterator and performing an async operation within it. This deadlocked my app unpredictably.

My solution was to collect the values from the iterator synchronously first, then loop over the collected values to perform my async operation.

httpjamesm avatar Mar 13 '24 19:03 httpjamesm

I was reading Alice's blog that was mentioned in this thread.

Alice points out that the compiler doesn't complain about a holding guard (or reference) over an await because it's Send.

I wonder if RefMut, RefMulti, etc. Send trait could be behind a feature gate? Or maybe no_send could be a feature? I'm sure there's a good reason these types are Send but if its possible for users to turn that off, it could make debugging easier?

leontoeides avatar Mar 22 '24 14:03 leontoeides