crossbeam crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations

crossbeam-channel appears to use pure spinlocks (which never block in the kernel), from userspace, which is a bad idea. Looking at the behavior with strace, it appears to be sched_yielding in a loop.

On my Linux system (Debian 5.15 kernel, standard configuration) with a simple little benchmark I wrote, I see this:

brian[13116] rico (master) ~/Desktop/crossbeam
$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/crossbeam-channel`
unbounded_spsc_sync       Rust crossbeam-channel   4.473 sec
brian[13117] rico (master) ~/Desktop/crossbeam
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
[warnings because I commented stuff out elided]
warning: `benchmarks` (bin "crossbeam-channel") generated 7 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/crossbeam-channel`
unbounded_spsc_sync       Rust crossbeam-channel 106.810 sec

It takes over 20x longer with both threads pinned to the same CPU, because each thread spins until its scheduler tick expires repeatedly. In my full application, I'm using SCHED_RR in one thread, which makes this a full-on deadlock because it will never yield and let the other thread run.

I'm running https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark, which is recent master (450d237d1546ac135d21f75b61fdf5a944e8421c) + my benchmark.

This seems related to #366 and #447.

https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html goes into more detail about why this is not a good implementation. It links to https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html which has some performance numbers indicating that this can be fixed with practically no cost in any case, and greatly increased speeds in some cases. It also links to https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ which shows some examples of just how bad this can be.

May 04 '22 02:05 bsilver8192

Spinlock was replaced with Mutex from std in #835, but that somehow has not improved things at all. Old results on my laptop (from https://github.com/bsilver8192/crossbeam/tree/crossbeam-channel-spin-benchmark):

$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel   4.685 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel  66.005 sec

New results (same benchmark, but rebased on latest master at the moment (70700182c4c92c393fc76209c7acad7af69dca21)):

$ cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel   4.562 sec
$ taskset 1 cargo run --manifest-path crossbeam-channel/benchmarks/Cargo.toml --bin crossbeam-channel
unbounded_spsc_sync       Rust crossbeam-channel  65.679 sec

Inspected with strace, both versions look mostly the same (a lot of sched_yield() with some futex(...) calls mixed in). I am using rustc 1.64.0-beta.3 (82bf34178 2022-08-18).

Aug 31 '22 13:08 yjhn

Running the benchmarks (both old and new) using taskset 1 with gdb and setting a breakpoint on sched_yield shows that most calls to sched_yield() are from Backoff::snooze(). Backoff is used in some of channel's internal data structures. This somewhat explains why there is almost no improvement from replacing the lock.

Aug 31 '22 14:08 yjhn

I can confirm this causes severe cpu overhead when using channel as a low throughput spsc channel, the spin locks eat away the cpu.

If you look at the channel ::recv source code you can tell it uses Backoff itself and Context::wait_until uses it again.

Furthermore Backoff seems to spin a bit too much before yielding to the scheduler? For comparison parking lot spin exponent is 3 vs crossbeam 6.

This is a profiling of a real application

Nov 10 '22 10:11 arthurprs

I think this stems from (not ideal) micro benchmarking and benchmark marketing. By reducing spinning most channel benchmarks regress, because they are often running non-contented, with #actors <= #cpus available, and/or with no work between send/recv.

But if you make it contented (e.g. using taskset) or have significant work outside of recv/send calls the situation often reverses.

Nov 11 '22 10:11 arthurprs

According to https://github.com/rust-lang/rust/pull/93563 by @ibraheemdev, is the real problem here not the unbounded spinning in unbounded queue (i.e., wait_write pointed by #366), but the amount of spin_loop_hint and yield_now and the fact that it is being called even where it is not needed?

Nov 20 '22 08:11 taiki-e

With the changes from std applied, which removes all extra spinning on recv, these are the new results on my machine:

before:
unbounded_spsc_sync       Rust crossbeam-channel   5.869 sec

before (taskset 1):
unbounded_spsc_sync       Rust crossbeam-channel  52.105 sec

after:
unbounded_spsc_sync       Rust crossbeam-channel  30.612 sec

after (taskset 1):
unbounded_spsc_sync       Rust crossbeam-channel  31.899 sec

Removing the spinning seems to affect throughput in this benchmark by a lot (which makes sense considering the benchmark), but as you can see taskset 1 had almost no effect. So it does look like the issue here is the extra spinning, not the blocking queue implementation.

Nov 20 '22 18:11 ibraheemdev

I can confirm that std::sync::mpsc in Rust nightly 1.67.0-nightly (c5d82ed7a 2022-11-19) does not have this problem, and crossbeam-channel 0.5.6 does still have the worse performance. Also, crossbeam-channel 0.5.6 has already fixed the deadlock I originally posted.

My timing numbers (running on a different machine than before, still amd64 Linux, only one timed run of each without any care to reduce outside interference so don't trust small differences):

	unpinned	pinned	pinned equal realtime	pinned unequal realtime
crossbeam-channel 450d237d1546ac135d21f75b61fdf5a944e8421c	2.505	63.061	66.994	deadlock
crossbeam-channel 0.5.6	2.454	63.629	66.208	102.530
Rust 1.65.0	17.778	11.566	11.292	8.133
Rust 1.67.0-nightly (c5d82ed7a 2022-11-19)	20.636	16.873	16.151	34.160

Some cautions about my test code, before I post it:

Do NOT run it on a single core machine, it WILL wedge your terminal.
It has some chance of breaking device drivers, with a slim chance of that leading to hardware damage. Rebooting to get certain devices working again is a very real possibility.
Running it on a CONFIG_PREEMPT_RT kernel is probably going to create problems.
Don't do it on a core that handles important interrupts, in my experience core 0 is usually the only one with dedicated interrupts but it depends.

I posted my (somewhat hacky) test code at https://github.com/bsilver8192/mpsc-spin-test if anybody else wants to try it. There are several commented-out sections to switch between the various tests.

Nov 20 '22 22:11 bsilver8192

I really do not want to pile on, but currently the loop at the top of recv which boils down to while !backoff.is_completed() { backoff.snooze(); } takes ~70-90 microseconds on recent Intel CPU, which is 20-50 greater than the cost of context switch. This really seems a bit too much for polling an empty queue.

Jan 12 '23 22:01 OlegOAndreev

This is from a profile from tantivy-cli. In the profile it seems most time is spent in the spin:

            for _ in 0..1 << self.step.get() {
                // TODO(taiki-e): once we bump the minimum required Rust version to 1.49+,
                // use [`core::hint::spin_loop`] instead.
                #[allow(deprecated)]
                atomic::spin_loop_hint();
            }

The send flow is using mpmc channels with 100 capacity. mpmc, since the user can configure the number of threads, which in this case is set to 1. Indexing is slower, so senders have to wait typically.

stdio lines -> 1 thread parse JSON -> 1 thread index in tantivy

perf record --call-graph dwarf 
perf report --stdio

    46.78%     0.00%  tantivy          libc.so.6             [.] 0x00007f98910037cb
            |
            ---0x7f98910037cb
               0x7f9890f7f9ea
               std::sys::unix::thread::Thread::new::thread_start
               |          
               |--24.35%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1f6f2d69b5f287ba
               |          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h73769fc82e80e4bb (inlined)
               |          std::panic::catch_unwind (inlined)
               |          std::panicking::try (inlined)
               |          std::panicking::try::do_call (inlined)
               |          <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
               |          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h93815b136742a096 (inlined)
               |          std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
               |          tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h3facae76dbfd38bf (inlined)
               |          |          
               |          |--16.98%--crossbeam_channel::channel::Sender<T>::send
               |          |          |          
               |          |           --16.93%--crossbeam_channel::flavors::array::Channel<T>::send
               |          |                     |          
               |          |                     |--12.35%--crossbeam_utils::backoff::Backoff::snooze (inlined)
               |          |                     |          |          
               |          |                     |          |--10.68%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
               |          |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
               |          |                     |          |          |          
               |          |                     |          |           --10.05%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
               |          |                     |          |          
               |          |                     |           --1.64%--__sched_yield
               |          |                     |                     |          
               |          |                     |                      --1.42%--entry_SYSCALL_64_after_hwframe
               |          |                     |                                |          
               |          |                     |                                 --1.40%--do_syscall_64
               |          |                     |                                           |          
               |          |                     |                                            --1.24%--__x64_sys_sched_yield
               |          |                     |                                                      |          
               |          |                     |                                                       --1.01%--schedule
               |          |                     |                                                                 |          
               |          |                     |                                                                  --0.95%--__schedule
               |          |                     |                                                                            |          
               |          |                     |                                                                             --0.54%--pick_next_task_fair
               |          |                     |          
               |          |                     |--3.15%--crossbeam_channel::context::Context::with (inlined)
               |          |                     |          |          
               |          |                     |           --3.15%--std::thread::local::LocalKey<T>::try_with (inlined)
               |          |                     |                     |          
               |          |                     |                      --3.14%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hd64289cd2a2278bd (inlined)
               |          |                     |                                |          
               |          |                     |                                 --3.12%--crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h0b76ad33a3783422
               |          |                     |                                           |          
               |          |                     |                                            --3.12%--crossbeam_channel::flavors::array::Channel$LT$T$GT$::send::_$u7b$$u7b$closure$u7d$$u7d$::h00a9cec38e2e6332 (inlined)
               |          |                     |                                                      |          
               |          |                     |                                                       --3.00%--crossbeam_channel::context::Context::wait_until
               |          |                     |                                                                 |          
               |          |                     |                                                                  --2.42%--crossbeam_utils::backoff::Backoff::snooze (inlined)
               |          |                     |                                                                            |          
               |          |                     |                                                                             --2.24%--core::sync::atomic::spin_loop_hint (inlined)
               |          |                     |                                                                                       core::hint::spin_loop (inlined)
               |          |                     |                                                                                       core::core_arch::x86::sse2::_mm_pause (inlined)
               |          |                     |          
               |          |                      --1.32%--crossbeam_channel::flavors::array::Channel<T>::start_send (inlined)
               |          |          
               |          |--6.45%--tantivy::index_document::IndexDocument::parse_json
               |          |          |          
               |          |           --6.03%--tantivy::index_document::to_value (inlined)
               |          |                     |          
               |          |                     |--4.22%--simd_json::Deserializer::from_slice
               |          |                     |          |          
               |          |                     |           --3.68%--simd_json::Deserializer::from_slice_with_buffers
               |          |                     |                     |          
               |          |                     |                     |--2.45%--simd_json::Deserializer::find_structural_bits (inlined)
               |          |                     |                     |          |          
               |          |                     |                     |           --2.45%--simd_json::Deserializer::find_structural_bits_avx
               |          |                     |                     |                     |          
               |          |                     |                     |                      --2.44%--simd_json::Deserializer::_find_structural_bits (inlined)
               |          |                     |                     |                                |          
               |          |                     |                     |                                 --1.81%--<simd_json::avx2::stage1::SimdInputAVX as simd_json::Stage1Parse<core::core_arch::x86::__m256i>>::flatten_bits >
               |          |                     |                     |                                           |          
               |          |                     |                     |                                            --1.51%--alloc::vec::Vec<T,A>::reserve (inlined)
               |          |                     |                     |                                                      alloc::raw_vec::RawVec<T,A>::reserve (inlined)
               |          |                     |                     |                                                      |          
               |          |                     |                     |                                                       --1.44%--alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
               |          |                     |                     |                                                                 |          
               |          |                     |                     |                                                                  --1.44%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
               |          |                     |                     |                                                                            |          
               |          |                     |                     |                                                                             --1.42%--alloc::raw_vec::finish_grow (inlined)
               |          |                     |                     |                                                                                       |          
               |          |                     |                     |                                                                                        --1.41%--<alloc::alloc::Global as core::alloc::Allocator>::grow (>
               |          |                     |                     |                                                                                                  alloc::alloc::Global::grow_impl (inlined)
               |          |                     |                     |                                                                                                  alloc::alloc::realloc (inlined)
               |          |                     |                     |                                                                                                  |          
               |          |                     |                     |                                                                                                   --1.40%--realloc
               |          |                     |                     |          
               |          |                     |                      --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
               |          |                     |          
               |          |                      --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
               |          |                                |          
               |          |                                 --1.57%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                           |          
               |          |                                            --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
               |          |                                                      |          
               |          |                     |                     |                                                                                                   --1.40%--realloc
               |          |                     |                     |          
               |          |                     |                      --1.08%--simd_json::stage2::<impl simd_json::Deserializer>::build_tape (inlined)
               |          |                     |          
               |          |                      --1.65%--tantivy::index_document::BorrowDeserializer::into_flattened (inlined)
               |          |                                |          
               |          |                                 --1.57%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                           |          
               |          |                                            --1.56%--tantivy::index_document::BorrowDeserializer::parse_map (inlined)
               |          |                                                      |          
               |          |                                                      |--0.75%--tantivy::index_document::BorrowDeserializer::parse
               |          |                                                      |          |          
               |          |                                                      |           --0.71%--tantivy::index_document::BorrowDeserializer::parse::_$u7b$$u7b$closure$u7d$$u7d$::he8db8863125cdbbb
               |          |                                                      |                     |          
               |          |                                                      |                      --0.55%--alloc::vec::Vec<T,A>::push (inlined)
               |          |                                                      |                                |          
               |          |                                                      |                                 --0.53%--alloc::raw_vec::RawVec<T,A>::reserve_for_push
               |          |                                                      |                                           |          
               |          |                                                      |                                            --0.53%--alloc::raw_vec::RawVec<T,A>::grow_amortized (inlined)
               |          |                                                      |                                                      |          
               |          |                                                      |                                                       --0.52%--alloc::raw_vec::finish_grow (inlined)
               |          |                                                      |                                                                 |          
               |          |                                                      |                                                                  --0.50%--malloc
               |          |                                                      |          
               |          |                                                       --0.56%--tantivy::schema::schema::Schema::get_field
               |          |          
               |           --0.63%--<crossbeam_channel::channel::IntoIter<T> as core::iter::traits::iterator::Iterator>::next
               |                     |          
               |                      --0.62%--crossbeam_channel::channel::Receiver<T>::recv (inlined)
               |                                |          
               |                                 --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
               |          
                --22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
                          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
                          |          
                           --22.43%--std::panic::catch_unwind (inlined)
                                     std::panicking::try (inlined)
                                     std::panicking::try::do_call (inlined)
                                     <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
                                     std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
                                     std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
               |                                |          
               |                                 --0.61%--crossbeam_channel::flavors::array::Channel<T>::recv
               |          
                --22.43%--core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h989fd13cc428f62c
                          std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h38d9a4e86260e253 (inlined)
                          |          
                           --22.43%--std::panic::catch_unwind (inlined)
                                     std::panicking::try (inlined)
                                     std::panicking::try::do_call (inlined)
                                     <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (inlined)
                                     std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h7e3f4df267158938 (inlined)
                                     std::sys_common::backtrace::__rust_begin_short_backtrace (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
                                     |           --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
                                     |                     |          
                                     |                     |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
                                     |                     |          |          
                                     |                     |          |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
                                     |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
                                     tantivy::commands::index::run_index::_$u7b$$u7b$closure$u7d$$u7d$::h6a107a0b11f61b63 (inlined)
                                     |          
                                     |--20.09%--crossbeam_channel::channel::Sender<T>::send
                                     |          |          
                                     |           --20.04%--crossbeam_channel::flavors::array::Channel<T>::send
                                     |                     |          
                                     |                     |--13.61%--crossbeam_utils::backoff::Backoff::snooze (inlined)
                                     |                     |          |          
                                     |                     |          |--11.13%--core::iter::range::<impl core::iter::traits::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)
                                     |                     |          |          <core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next (inlined)
                                     |                     |          |          |          
                                     |                     |          |           --10.48%--core::cmp::impls::<impl core::cmp::PartialOrd for i32>::lt (inlined)
                                     |                     |          |          
                                     |                     |           --2.43%--__sched_yield

Oct 24 '23 04:10 PSeitz

crossbeam crossbeam copied to clipboard

crossbeam-channel uses pure userspace spinlocks, which perform very poorly in some situations

crossbeam
crossbeam copied to clipboard