Potential Race Condition in Debt Payment System Leading to Crashes
Summary
We're experiencing crashes after upgrading from concread 0.2.21 to 0.5.5 in 389-ds-base (0.2.21 hasn't used arc-swap).
And now, the crashes occur in arc-swap's debt payment system during Option<Arc<T>> operations, with both stack traces and Valgrind data races pointing to the same area.
Environment
- arc-swap: 1.7.1
- concread: 0.5.5
- 389-ds-base: latest versions with NDN cache rework
- Setup: SSSD + IPA with contention on Normalized DN cache
Reproducibility: Difficult to reproduce - requires specific timing conditions under load in production SSSD/IPA setups.
Evidence
Stack Trace
The crash consistently occurs in this call chain:
#0 core::sync::atomic::atomic_add<usize> ← CRASH HERE
#1 core::sync::atomic::AtomicUsize::fetch_add
#2 alloc::sync::{impl#28}::clone<...> (Arc clone)
#3 core::option::{impl#5}::clone<...> (Option<Arc> clone)
#4 arc_swap::ref_cnt::RefCnt::inc<Option<Arc<...>>>
#5 arc_swap::debt::{impl#2}::pay_all::{closure#0}<...>
#6 arc_swap::debt::list::{impl#3}::with::{closure#0}<...>
#7 std::thread::local::LocalKey<arc_swap::debt::list::LocalNode>::try_with<...>
#8 arc_swap::debt::list::LocalNode::with<...>
#9 arc_swap::debt::Debt::pay_all<...>
#10 arc_swap::strategy::hybrid::{impl#5}::wait_for_readers<...>
#11 arc_swap::{impl#7}::drop<...>
#12 core::ptr::drop_in_place<arc_swap::ArcSwapAny<...>>
#13 core::ptr::drop_in_place<concread::internals::lincowcell::LinCowCellInner<...>>
#14 alloc::sync::Arc<...>::drop_slow<...>
Valgrind Data Races
Helgrind detected multiple concurrent data races in the debt system (I've tried to stress it with a custom test - so I'm not fully sure it's valid, but maybe it'll give a hint?):
Race 1: Node::get access
==18238== Possible data race during read of size 8 at 0x4A8CEF0 by thread #4
==18238== Locks held: none
==18238== at 0x16FF2E: arc_swap::debt::list::Node::get
==18238== by 0x129D3B: arc_swap::debt::list::LocalNode::with
==18238== by 0x124693: std::sys::backtrace::__rust_begin_short_backtrace
==18238== by 0x123026: core::ops::function::FnOnce::call_once{{vtable.shim}}
==18238== by 0x19DF5A: std::sys::pal::unix::thread::Thread::new::thread_start
==18238== Address 0x4a8cef0 is 112 bytes inside a block of size 128 alloc'd
==18238== at 0x484F062: posix_memalign (vg_replace_malloc.c:2226)
==18238== by 0x19AE5D: __rustc::__rdl_alloc
==18238== by 0x16FF50: arc_swap::debt::list::Node::get
==18238== by 0x129D3B: arc_swap::debt::list::LocalNode::with
==18238== Block was alloc'd by thread #3
Race 2: pay_all closure
==18238== Possible data race during read of size 8 at 0x4A8D3C0 by thread #4
==18238== Locks held: none
==18238== at 0x129B3B: arc_swap::debt::Debt::pay_all::{{closure}}
==18238== by 0x129D5A: arc_swap::debt::list::LocalNode::with
==18238== by 0x124693: std::sys::backtrace::__rust_begin_short_backtrace
==18238== by 0x123026: core::ops::function::FnOnce::call_once{{vtable.shim}}
==18238== by 0x19DF5A: std::sys::pal::unix::thread::Thread::new::thread_start
==18238== Address 0x4a8d3c0 is 0 bytes inside a block of size 24 alloc'd
==18238== at 0x48468AF: malloc (vg_replace_malloc.c:446)
==18238== by 0x1247F3: std::sys::backtrace::__rust_begin_short_backtrace
==18238== by 0x123026: core::ops::function::FnOnce::call_once{{vtable.shim}}
Analysis and Context
There may be a race condition in arc-swap's debt payment system when handling Option<Arc<T>> values, particularly when rapid Arc reference count changes are involved under high contention.
- The crash occurs specifically during
Option<Arc<T>>::cloneoperations withinRefCnt::inc - Both crashes and data races are concentrated in the debt payment system
- The issue manifests during the
wait_for_readers→pay_allsequence - High contention and right timing scenarios trigger the problem
This issue originally surfaced in the concread repository (issue #137). A proposed fix in concread (using load_full() instead of load()) was tested but did not resolve the crashes, suggesting the root cause lies deeper in arc-swap's debt management system.
Request
Could you please investigate/suggest whether:
- There's a known race condition in the debt payment system for
Option<Arc<T>>types? - The analysis pointing to arc-swap is correct, or if this represents a misuse pattern in concread?
- Are there recommended workarounds or fixes for this issue?
This is affecting NDN cache functionality (it becomes unusable for some systems), so any guidance would be greatly appreciated.
Additional Context
The related concread code - lincowcell_async module could serve as a starting point for understanding how concread interacts with arc-swap.
Note: I'm relatively new to this area, so please verify this analysis. Happy to provide additional debugging information if needed (but no reproducer for now, though).
Hello
I'm hunting something that may be related or the same thing, for several months now, in #156. So far, I have both several sources that tell me it's wrong, and a reasoning / semi-proof that it's correct. Your stack traces from valgrind might give another data point to the search.
As for workarounds:
- I have strong suspicion this only happens in scenarios of concurrent writers (eg. store / swap or similar).
- Maybe you could try using older versions of arc-swap, older versions used different systems for managing the memory.
In concread this is not a heavy write case actually. Concread is many readers with a singular writer, and the single write is protected by a separate mutex. What we are seeing is this in a heavy read case.
I have a suspicion what I've overlooked here. Could you try with this commit, to confirm the suspicion? https://github.com/vorner/arc-swap/commit/c521229c16fdc85a3a09d9bf888a9a3dacd1565b (or, it's the experiment branch in the repo).
(in case you're not familiar, there's a way to override a dependency: https://doc.rust-lang.org/cargo/reference/overriding-dependencies.html)
@droideck As you have the reproducer, would you be able to test this at some point?
Unfortunately, we don't have a direct reproducer for this issue. We've a customer with a testing environment who has been willing to test, but we've already provided them with a couple of test builds, and the next one scheduled is with arc-swap removed from concread entirely.
So unfortunately, we can't test the commit at this time. I'm sorry we can't help verify your suspicion right now...
If we manage to find an exact reproducer, I'll definitely let you know, and we can test it then!