arc-swap icon indicating copy to clipboard operation
arc-swap copied to clipboard

Potential Race Condition in Debt Payment System Leading to Crashes

Open droideck opened this issue 6 months ago • 5 comments

Summary

We're experiencing crashes after upgrading from concread 0.2.21 to 0.5.5 in 389-ds-base (0.2.21 hasn't used arc-swap). And now, the crashes occur in arc-swap's debt payment system during Option<Arc<T>> operations, with both stack traces and Valgrind data races pointing to the same area.

Environment

  • arc-swap: 1.7.1
  • concread: 0.5.5
  • 389-ds-base: latest versions with NDN cache rework
  • Setup: SSSD + IPA with contention on Normalized DN cache

Reproducibility: Difficult to reproduce - requires specific timing conditions under load in production SSSD/IPA setups.

Evidence

Stack Trace

The crash consistently occurs in this call chain:

#0  core::sync::atomic::atomic_add<usize>  ← CRASH HERE
#1  core::sync::atomic::AtomicUsize::fetch_add
#2  alloc::sync::{impl#28}::clone<...> (Arc clone)
#3  core::option::{impl#5}::clone<...> (Option<Arc> clone)
#4  arc_swap::ref_cnt::RefCnt::inc<Option<Arc<...>>>
#5  arc_swap::debt::{impl#2}::pay_all::{closure#0}<...>
#6  arc_swap::debt::list::{impl#3}::with::{closure#0}<...>
#7  std::thread::local::LocalKey<arc_swap::debt::list::LocalNode>::try_with<...>
#8  arc_swap::debt::list::LocalNode::with<...>
#9  arc_swap::debt::Debt::pay_all<...>
#10 arc_swap::strategy::hybrid::{impl#5}::wait_for_readers<...>
#11 arc_swap::{impl#7}::drop<...>
#12 core::ptr::drop_in_place<arc_swap::ArcSwapAny<...>>
#13 core::ptr::drop_in_place<concread::internals::lincowcell::LinCowCellInner<...>>
#14 alloc::sync::Arc<...>::drop_slow<...>

Full Stacktrace

Valgrind Data Races

Helgrind detected multiple concurrent data races in the debt system (I've tried to stress it with a custom test - so I'm not fully sure it's valid, but maybe it'll give a hint?):

Race 1: Node::get access

==18238== Possible data race during read of size 8 at 0x4A8CEF0 by thread #4
==18238== Locks held: none
==18238==    at 0x16FF2E: arc_swap::debt::list::Node::get
==18238==    by 0x129D3B: arc_swap::debt::list::LocalNode::with
==18238==    by 0x124693: std::sys::backtrace::__rust_begin_short_backtrace
==18238==    by 0x123026: core::ops::function::FnOnce::call_once{{vtable.shim}}
==18238==    by 0x19DF5A: std::sys::pal::unix::thread::Thread::new::thread_start
==18238==  Address 0x4a8cef0 is 112 bytes inside a block of size 128 alloc'd
==18238==    at 0x484F062: posix_memalign (vg_replace_malloc.c:2226)
==18238==    by 0x19AE5D: __rustc::__rdl_alloc
==18238==    by 0x16FF50: arc_swap::debt::list::Node::get
==18238==    by 0x129D3B: arc_swap::debt::list::LocalNode::with
==18238==  Block was alloc'd by thread #3

Race 2: pay_all closure

==18238== Possible data race during read of size 8 at 0x4A8D3C0 by thread #4
==18238== Locks held: none
==18238==    at 0x129B3B: arc_swap::debt::Debt::pay_all::{{closure}}
==18238==    by 0x129D5A: arc_swap::debt::list::LocalNode::with
==18238==    by 0x124693: std::sys::backtrace::__rust_begin_short_backtrace
==18238==    by 0x123026: core::ops::function::FnOnce::call_once{{vtable.shim}}
==18238==    by 0x19DF5A: std::sys::pal::unix::thread::Thread::new::thread_start
==18238==  Address 0x4a8d3c0 is 0 bytes inside a block of size 24 alloc'd
==18238==    at 0x48468AF: malloc (vg_replace_malloc.c:446)
==18238==    by 0x1247F3: std::sys::backtrace::__rust_begin_short_backtrace
==18238==    by 0x123026: core::ops::function::FnOnce::call_once{{vtable.shim}}

Analysis and Context

There may be a race condition in arc-swap's debt payment system when handling Option<Arc<T>> values, particularly when rapid Arc reference count changes are involved under high contention.

  1. The crash occurs specifically during Option<Arc<T>>::clone operations within RefCnt::inc
  2. Both crashes and data races are concentrated in the debt payment system
  3. The issue manifests during the wait_for_readerspay_all sequence
  4. High contention and right timing scenarios trigger the problem

This issue originally surfaced in the concread repository (issue #137). A proposed fix in concread (using load_full() instead of load()) was tested but did not resolve the crashes, suggesting the root cause lies deeper in arc-swap's debt management system.

Request

Could you please investigate/suggest whether:

  1. There's a known race condition in the debt payment system for Option<Arc<T>> types?
  2. The analysis pointing to arc-swap is correct, or if this represents a misuse pattern in concread?
  3. Are there recommended workarounds or fixes for this issue?

This is affecting NDN cache functionality (it becomes unusable for some systems), so any guidance would be greatly appreciated.

Additional Context

The related concread code - lincowcell_async module could serve as a starting point for understanding how concread interacts with arc-swap.

Note: I'm relatively new to this area, so please verify this analysis. Happy to provide additional debugging information if needed (but no reproducer for now, though).

droideck avatar Jun 10 '25 04:06 droideck

Hello

I'm hunting something that may be related or the same thing, for several months now, in #156. So far, I have both several sources that tell me it's wrong, and a reasoning / semi-proof that it's correct. Your stack traces from valgrind might give another data point to the search.

As for workarounds:

  • I have strong suspicion this only happens in scenarios of concurrent writers (eg. store / swap or similar).
  • Maybe you could try using older versions of arc-swap, older versions used different systems for managing the memory.

vorner avatar Jun 14 '25 19:06 vorner

In concread this is not a heavy write case actually. Concread is many readers with a singular writer, and the single write is protected by a separate mutex. What we are seeing is this in a heavy read case.

Firstyear avatar Jun 17 '25 01:06 Firstyear

I have a suspicion what I've overlooked here. Could you try with this commit, to confirm the suspicion? https://github.com/vorner/arc-swap/commit/c521229c16fdc85a3a09d9bf888a9a3dacd1565b (or, it's the experiment branch in the repo).

(in case you're not familiar, there's a way to override a dependency: https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html)

vorner avatar Jun 20 '25 08:06 vorner

@droideck As you have the reproducer, would you be able to test this at some point?

Firstyear avatar Jun 21 '25 00:06 Firstyear

Unfortunately, we don't have a direct reproducer for this issue. We've a customer with a testing environment who has been willing to test, but we've already provided them with a couple of test builds, and the next one scheduled is with arc-swap removed from concread entirely.

So unfortunately, we can't test the commit at this time. I'm sorry we can't help verify your suspicion right now...

If we manage to find an exact reproducer, I'll definitely let you know, and we can test it then!

droideck avatar Jun 21 '25 03:06 droideck