deadpool icon indicating copy to clipboard operation
deadpool copied to clipboard

Easy way to create dead-lock in `retain` function

Open xoac opened this issue 2 months ago • 4 comments

Hi, small modification of retain function can lead to dead-lock.

    let interval = Duration::from_secs(30);
    let max_age = Duration::from_secs(60);
    tokio::spawn(async move {
        loop {
            tokio::time::sleep(interval).await;
            pool.retain(|_, metrics| {
                let should_be_remove = pool.status().size > 1 && metrics.last_used() > max_age;
                !should_be_removed
            });
        }
    });

Above calls self.inner.slots.lock().unwrap() twice in retain function because status() also blocks. The easiest way would be to make status lock-free. Status is already documented as not consistent so I think it should be possible to achieve.

xoac avatar Oct 06 '25 10:10 xoac

I've never thought about using pool.status() within the retain callback. That is indeed unfortunate and will cause a deadlock as retain already locks the pool.

In order to make status() not lock the pool I would need to add two AtomicUsize for tracking the size and max_size.

Alternatively what do you think about passing the retain callback the Status as a parameter instead? It's relatively cheap to construct and should not affect the retain performance in any meaningful way.

So your code would become:

pool.retain(|_, metrics, status| {
    let should_be_remove = status.size > 1 && metrics.last_used() > max_age;
    !should_be_removed
});

bikeshedder avatar Oct 10 '25 11:10 bikeshedder

@bikeshedder I think your suggest code is incorrect, because you want status.size to be updated on every loop. Actually my code with pool.status().size is not corrected either, because status stats are updated after retain loop finish.

Actually the code that I use is like below.

  let mut pool_size = pool.status().size;
  let pool_size_ref = &mut pool_size;

  // Remove all connections that were not used for specific timeout
  let retain_result = pool.retain(|_item, status| {
      // Item is closed
      let should_be_removed = *pool_size_ref > 1 && status.last_used() > idle_timeout;

      if should_be_removed {
          *pool_size_ref -= 1;
      }

      // true means to keep item
      !should_be_removed
  });

But my point with this issue wasn't to get it working as expected for my purpose. I just wanted to point the API defect here. I think fix with AtomicUsize is grate idea that allow you keep at least one live connection.

xoac avatar Oct 10 '25 13:10 xoac

@bikeshedder I think your suggest code is incorrect, because you want status.size to be updated on every loop. Actually my code with pool.status().size is not corrected either, because status stats are updated after retain loop finish.

The status is is passed as argument to the closure and would be updated for every run of the closure.

Regardless, you're right pointing out that a status function that needs to lock the pool internals isn't pretty. I guess you were quite surprised to learn that the hard way. 🙈

bikeshedder avatar Oct 10 '25 14:10 bikeshedder

Not exactly. It was a small PR but I decided to write a test anyway I already have a part that was running postgres in a container and after running the test just froze. So after I excluded the issue on the test side I checked the retain and status implementation then created this PR.

Actually it was very exciting. I didn't deal with deadlock I think at least a year or even more. And finding the root cause was less than 2h thanks to clear implementation on your side and rust safety guarantees.

xoac avatar Oct 14 '25 10:10 xoac