crossbeam
crossbeam copied to clipboard
crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations
crossbeam-channel appears to use pure spinlocks (which never block in the kernel), from userspace, which is a bad idea. Looking at the behavior with strace, it appears to be sched_yielding in a loop.
On my Linux system (Debian 5.15 kernel, standard configuration) with a simple little benchmark I wrote, I see this:
brian[13116] rico (master) ~/Desktop/crossbeam
$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
Finished dev [unoptimized + debuginfo] target(s) in 0.06s
Running `target/debug/crossbeam-channel`
unbounded_spsc_sync Rust crossbeam-channel 4.473 sec
brian[13117] rico (master) ~/Desktop/crossbeam
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
Finished dev [unoptimized + debuginfo] target(s) in 0.06s
Running `target/debug/crossbeam-channel`
unbounded_spsc_sync Rust crossbeam-channel 106.810 sec
It takes over 20x longer with both threads pinned to the same CPU, because each thread spins until its scheduler tick expires repeatedly. In my full application, I'm using SCHED_RR in one thread, which makes this a full-on deadlock because it will never yield and let the other thread run.
I'm running https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark, which is recent master (450d237d1546ac135d21f75b61fdf5a944e8421c) + my benchmark.
This seems related to #366 and #447.
https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html goes into more detail about why this is not a good implementation. It links to https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html which has some performance numbers indicating that this can be fixed with practically no cost in any case, and greatly increased speeds in some cases. It also links to https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ which shows some examples of just how bad this can be.
Spinlock was replaced with Mutex from std in #835, but that somehow has not improved things at all. Old results on my laptop (from https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark):
$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 4.685 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 66.005 sec
New results (same benchmark, but rebased on latest master at the moment (70700182c4c92c393fc76209c7acad7af69dca21)):
$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 4.562 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync Rust crossbeam-channel 65.679 sec
Inspected with strace, both versions look mostly the same (a lot of sched_yield() with some futex(...) calls mixed in).
I am using rustc 1.64.0-beta.3 (82bf34178 2022-08-18).
Running the benchmarks (both old and new) using taskset 1 with gdb and setting a breakpoint on sched_yield shows that most calls to sched_yield() are from Backoff::snooze(). Backoff is used in some of channel's internal data structures. This somewhat explains why there is almost no improvement from replacing the lock.
I can confirm this causes severe cpu overhead when using channel as a low throughput spsc channel, the spin locks eat away the cpu.
If you look at the channel ::recv source code you can tell it uses Backoff itself and Context::wait_until uses it again.
Furthermore Backoff seems to spin a bit too much before yielding to the scheduler? For comparison parking lot spin exponent is 3 vs crossbeam 6.
This is a profiling of a real application

I think this stems from (not ideal) micro benchmarking and benchmark marketing. By reducing spinning most channel benchmarks regress, because they are often running non-contented, with #actors <= #cpus available, and/or with no work between send/recv.
But if you make it contented (e.g. using taskset) or have significant work outside of recv/send calls the situation often reverses.
According to https://github.com/rust-lang/rust/pull/93563 by @ibraheemdev, is the real problem here not the unbounded spinning in unbounded queue (i.e., wait_write pointed by #366), but the amount of spin_loop_hint and yield_now and the fact that it is being called even where it is not needed?
With the changes from std applied, which removes all extra spinning on recv, these are the new results on my machine:
before:
unbounded_spsc_sync Rust crossbeam-channel 5.869 sec
before (taskset 1):
unbounded_spsc_sync Rust crossbeam-channel 52.105 sec
after:
unbounded_spsc_sync Rust crossbeam-channel 30.612 sec
after (taskset 1):
unbounded_spsc_sync Rust crossbeam-channel 31.899 sec
Removing the spinning seems to affect throughput in this benchmark by a lot (which makes sense considering the benchmark), but as you can see taskset 1 had almost no effect. So it does look like the issue here is the extra spinning, not the blocking queue implementation.
I can confirm that std::sync::mpsc in Rust nightly 1.67.0-nightly (c5d82ed7a 2022-11-19) does not have this problem, and crossbeam-channel 0.5.6 does still have the worse performance. Also, crossbeam-channel 0.5.6 has already fixed the deadlock I originally posted.
My timing numbers (running on a different machine than before, still amd64 Linux, only one timed run of each without any care to reduce outside interference so don't trust small differences):
| unpinned | pinned | pinned equal realtime | pinned unequal realtime | |
|---|---|---|---|---|
| crossbeam-channel 450d237d1546ac135d21f75b61fdf5a944e8421c | 2.505 | 63.061 | 66.994 | deadlock |
| crossbeam-channel 0.5.6 | 2.454 | 63.629 | 66.208 | 102.530 |
| Rust 1.65.0 | 17.778 | 11.566 | 11.292 | 8.133 |
| Rust 1.67.0-nightly (c5d82ed7a 2022-11-19) | 20.636 | 16.873 | 16.151 | 34.160 |
Some cautions about my test code, before I post it:
- Do NOT run it on a single core machine, it WILL wedge your terminal.
- It has some chance of breaking device drivers, with a slim chance of that leading to hardware damage. Rebooting to get certain devices working again is a very real possibility.
- Running it on a CONFIG_PREEMPT_RT kernel is probably going to create problems.
- Don't do it on a core that handles important interrupts, in my experience core 0 is usually the only one with dedicated interrupts but it depends.
I posted my (somewhat hacky) test code at https://github.com/bsilver8192/mpsc-spin-test if anybody else wants to try it. There are several commented-out sections to switch between the various tests.
I really do not want to pile on, but currently the loop at the top of recv which boils down to
while !backoff.is_completed() { backoff.snooze(); }
takes ~70-90 microseconds on recent Intel CPU, which is 20-50 greater than the cost of context switch. This really seems a bit too much for polling an empty queue.
This is from a profile from tantivy-cli. In the profile it seems most time is spent in the spin:
for _ in 0..1 << self.step.get() {
// TODO(taiki-e): once we bump the minimum required Rust version to 1.49+,
// use [`core::hint::spin_loop`] instead.
#[allow(deprecated)]
atomic::spin_loop_hint();
}
The send flow is using mpmc channels with 100 capacity. mpmc, since the user can configure the number of threads, which in this case is set to 1. Indexing is slower, so senders have to wait typically.
stdio lines -> 1 thread parse JSON -> 1 thread index in tantivy
perf record --call-graph dwarf
perf report --stdio
46.78% 0.00% tantivy libc.so.6 [.] 0x00007f98910037cb
|
---0x7f98910037cb
0x7f9890f7f9ea
std::sys::unix::thread::Thread::new::thread_start
|
|--24.35%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1f6f2d69b5f287ba
| std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h73769fc82e80e4bb (inlined)
| std::panic::catch_unwind (inlined)
| std::panicking::try (inlined)
| std::panicking::try::do_call (inlined)
| <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
| std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h93815b136742a096 (inlined)
| std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
| tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h3facae76dbfd38bf (inlined)
| |
| |--16.98%--crossbeam_channel::channel::Sender<T>::send
| | |
| | --16.93%--crossbeam_channel::flavors::array::Channel<T>::send
| | |
| | |--12.35%--crossbeam_utils::backoff::Backoff::snooze (inlined)
| | | |
| | | |--10.68%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
| | | | <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
| | | | |
| | | | --10.05%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
| | | |
| | | --1.64%--__sched_yield
| | | |
| | | --1.42%--entry_SYSCALL_64_after_hwframe
| | | |
| | | --1.40%--do_syscall_64
| | | |
| | | --1.24%--__x64_sys_sched_yield
| | | |
| | | --1.01%--schedule
| | | |
| | | --0.95%--__schedule
| | | |
| | | --0.54%--pick_next_task_fair
| | |
| | |--3.15%--crossbeam_channel::context::Context::with (inlined)
| | | |
| | | --3.15%--std::thread::local::LocalKey<T>::try_with (inlined)
| | | |
| | | --3.14%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hd64289cd2a2278bd (inlined)
| | | |
| | | --3.12%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h0b76ad33a3783422
| | | |
| | | --3.12%--crossbeam_channel::flavors::array::Channel$LT$T$GT$::send::_$u7b$$u7b$closure$u7d$$u7d$::h00a9cec38e2e6332 (inlined)
| | | |
| | | --3.00%--crossbeam_channel::context::Context::wait_until
| | | |
| | | --2.42%--crossbeam_utils::backoff::Backoff::snooze (inlined)
| | | |
| | | --2.24%--core::sync::atomic::spin_loop_hint (inlined)
| | | core::hint::spin_loop (inlined)
| | | core::core_arch::x86::sse2::_mm_pause (inlined)
| | |
| | --1.32%--crossbeam_channel::flavors::array::Channel<T>::start_send (inlined)
| |
| |--6.45%--tantivy::index_document::IndexDocument::parse_json
| | |
| | --6.03%--tantivy::index_document::to_value (inlined)
| | |
| | |--4.22%--simd_json::Deserializer::from_slice
| | | |
| | | --3.68%--simd_json::Deserializer::from_slice_with_buffers
| | | |
| | | |--2.45%--simd_json::Deserializer::find_structural_bits (inlined)
| | | | |
| | | | --2.45%--simd_json::Deserializer::find_structural_bits_avx
| | | | |
| | | | --2.44%--simd_json::Deserializer::_find_structural_bits (inlined)
| | | | |
| | | | --1.81%--<simd_json::avx2::stage1::SimdInputAVX as simd_json::Stage1Parse<core::core_arch::x86::__m256i>>::flatten_bits >
| | | | |
| | | | --1.51%--alloc::vec::Vec<T,A>::reserve (inlined)
| | | | alloc::raw_vec::RawVec<T,A>::reserve (inlined)
| | | | |
| | | | --1.44%--alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
| | | | |
| | | | --1.44%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
| | | | |
| | | | --1.42%--alloc::raw_vec::finish_grow (inlined)
| | | | |
| | | | --1.41%--<alloc::alloc::Global as core::alloc::Allocator>::grow (>
| | | | alloc::alloc::Global::grow_impl (inlined)
| | | | alloc::alloc::realloc (inlined)
| | | | |
| | | | --1.40%--realloc
| | | |
| | | --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
| | |
| | --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
| | |
| | --1.57%--tantivy::index_document::BorrowDeserializer::parse
| | |
| | --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
| | |
| | | | --1.40%--realloc
| | | |
| | | --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
| | |
| | --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
| | |
| | --1.57%--tantivy::index_document::BorrowDeserializer::parse
| | |
| | --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
| | |
| | |--0.75%--tantivy::index_document::BorrowDeserializer::parse
| | | |
| | | --0.71%--tantivy::index_document::BorrowDeserializer::parse::_$u7b$$u7b$closure$u7d$$u7d$::he8db8863125cdbbb
| | | |
| | | --0.55%--alloc::vec::Vec<T,A>::push (inlined)
| | | |
| | | --0.53%--alloc::raw_vec::RawVec<T,A>::reserve_for_push
| | | |
| | | --0.53%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
| | | |
| | | --0.52%--alloc::raw_vec::finish_grow (inlined)
| | | |
| | | --0.50%--malloc
| | |
| | --0.56%--tantivy::schema::schema::Schema::get_field
| |
| --0.63%--<crossbeam_channel::channel::IntoIter<T> as core::iter::traits::iterator::Iterator>::next
| |
| --0.62%--crossbeam_channel::channel::Receiver<T>::recv (inlined)
| |
| --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
|
--22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
|
--22.43%--std::panic::catch_unwind (inlined)
std::panicking::try (inlined)
std::panicking::try::do_call (inlined)
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
|
|--20.09%--crossbeam_channel::channel::Sender<T>::send
| |
| |
| --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
|
--22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
|
--22.43%--std::panic::catch_unwind (inlined)
std::panicking::try (inlined)
std::panicking::try::do_call (inlined)
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
|
|--20.09%--crossbeam_channel::channel::Sender<T>::send
| |
| --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
| |
| |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
| | |
| | |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
| | | <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
|
|--20.09%--crossbeam_channel::channel::Sender<T>::send
| |
| --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
| |
| |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
| | |
| | |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
| | | <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
| | | |
| | | --10.48%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
| | |
| | --2.43%--__sched_yield