futures-rs icon indicating copy to clipboard operation
futures-rs copied to clipboard

`AtomicWaker::take` performance issue

Open wyfo opened this issue 11 months ago • 1 comments

Currently, AtomicWaker::take does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.

I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:

  • futures::task::AtomicWaker 6.585227ms
  • crate::AtomicWaker 65.72µs

Benchmark code:

let waker = AtomicWaker::new(); // no registered waker
let start = std::Instant::now();
(0..100000).into_par_iter().for_each(|_| waker.wake()); // use rayon for contention
println!("{:?}", start.elapsed());

These results can easily be explained: no waker registered means a single AtomicU8::load with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added #[inline], but even without inlining, it's still around 100µs, way faster than the current implementation.

For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer. AtomicWaker seems to fit for the consumer, but, as stated above, the performance impact is huge. In fact, I can't even use AtomicWaker because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handle Thread::unpark), I thought it would be a good idea to propose it for AtomicWaker optimization.

Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.

EDIT: The original implementation is here

wyfo avatar Jul 09 '23 10:07 wyfo