async-executor Static Executors

Resolves #111. Creates a StaticExecutor type under a feature flag and allows constructing it from an Executor via Executor::leak. Unlike the executor it came from, it's a wrapper around a State and omits all changes to active.

Note, unlike the API proposed in #111, this PR also includes a unsafe StaticExecutor::spawn_scoped for spawning non-'static tasks, where the caller is responsible for ensuring that the task doesn't outlive the borrowed state. This would be required for Bevy to migrate to this type, where we're currently using lifetime transmutation on Executor to enable Thread::scope-like APIs for working with borrowed state. StaticExecutor does not have an external lifetime parameter so this approach is infeasible without such an API.

The performance gains while using the type are substantial:

single_thread/executor::spawn_one
                        time:   [1.6157 µs 1.6238 µs 1.6362 µs]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
single_thread/executor::spawn_batch
                        time:   [28.169 µs 29.650 µs 32.196 µs]
Found 19 outliers among 100 measurements (19.00%)
  10 (10.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
single_thread/executor::spawn_many_local
                        time:   [6.1952 ms 6.2230 ms 6.2578 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
single_thread/executor::spawn_recursively
                        time:   [50.202 ms 50.479 ms 50.774 ms]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
single_thread/executor::yield_now
                        time:   [5.8795 ms 5.8883 ms 5.8977 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

multi_thread/executor::spawn_one
                        time:   [1.2565 µs 1.2979 µs 1.3470 µs]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
multi_thread/executor::spawn_batch
                        time:   [38.009 µs 43.693 µs 52.882 µs]
Found 22 outliers among 100 measurements (22.00%)
  21 (21.00%) high mild
  1 (1.00%) high severe
Benchmarking multi_thread/executor::spawn_many_local: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 386.6s, or reduce sample count to 10.
multi_thread/executor::spawn_many_local
                        time:   [27.492 ms 27.652 ms 27.814 ms]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s, or reduce sample count to 30.
multi_thread/executor::spawn_recursively
                        time:   [165.82 ms 166.04 ms 166.26 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
multi_thread/executor::yield_now
                        time:   [22.469 ms 22.649 ms 22.798 ms]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low severe
  3 (3.00%) low mild

single_thread/leaked_executor::spawn_one
                        time:   [1.4717 µs 1.4778 µs 1.4832 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
single_thread/leaked_executor::spawn_many_local
                        time:   [4.2622 ms 4.3065 ms 4.3489 ms]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
single_thread/leaked_executor::spawn_recursively
                        time:   [26.566 ms 26.899 ms 27.228 ms]
single_thread/leaked_executor::yield_now
                        time:   [5.7200 ms 5.7270 ms 5.7342 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

multi_thread/leaked_executor::spawn_one
                        time:   [1.3755 µs 1.4321 µs 1.4892 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
multi_thread/leaked_executor::spawn_many_local
                        time:   [4.1838 ms 4.2394 ms 4.2989 ms]
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
multi_thread/leaked_executor::spawn_recursively
                        time:   [43.074 ms 43.159 ms 43.241 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
multi_thread/leaked_executor::yield_now
                        time:   [23.210 ms 23.257 ms 23.302 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Apr 11 '24 07:04 james7132

Hmmm, miri clearly does not like this even though the leak is intentional. Not sure how to tackle this without wholesale disabling leak detection.

Apr 11 '24 07:04 james7132

Another alternative is to follow through with the repr(transparent) use detailed in #111, and instead allow LeakedExecutor to be const-constructible (would require ConcurrentQueue::unbounded to be const). This way it could be used in a static.

Apr 11 '24 08:04 james7132

It took me a second to wrap around what this does conceptually, but now that I get it I think this is an interesting idea as an optimization. I have no opinion on the actual impl, so I'll defer to someone else to review.

As perhaps a tiny bikeshed point to raise: when I saw LeakedExecutor it initially threw me a little, as leaking is generally undesirable, and I wasn't sure why we would want to leak an executor?

However, I think we could reframe what this does as: "it enables an executor to exist for the duration of the entire program". This is something we often called "static" (e.g. 'static, static, lazy_static!) and I wonder if maybe the name StaticExecutor might work? — It's a genuine question; I'm not yet convinced this is better per se, but I wanted to raise the option to discuss.

Apr 12 '24 11:04 yoshuawuyts

can we ensure/check that the LeakedExecutor methods are correctly inlined (which isn't direct clear when they contain await and such)

Apr 12 '24 13:04 fogti

can we ensure/check that the LeakedExecutor methods are correctly inlined (which isn't direct clear when they contain await and such)

I don't know how this would be automatically checked, or if it really matters, given that the benchmarks seem to be faster than normal anyways

Apr 12 '24 23:04 notgull

Another alternative is to follow through with the repr(transparent) use detailed in #111, and instead allow LeakedExecutor to be const-constructible (would require ConcurrentQueue::unbounded to be const). This way it could be used in a static.

Any thoughts on this construction instead? This would avoid needing to wrap a StaticExecutor in a static variable or a StaticLocalExecutor in a thread_local! without the overhead of a OnceLock.

Apr 13 '24 06:04 james7132

re my comment: I don't think it could be automatically checked, but maybe we should still check the assembly that gets generated for obvious unnecessary indirections (just because it is already much faster doesn't mean there aren't other easy steps possible to make it even better...

Apr 13 '24 10:04 fogti

@fogti we could desugar the user-facing versions of it and force a #[inline(always)] annotation on the State implementations for the functions. It definitely looks worse in terms of user-facing documentation, though I don't think it's a breaking change to do so:

// before
pub fn async run<T>(&self, fut: impl Future<Output = T>) -> T;
// after
pub fn run<'r, T: 'r>(&'r self, future: impl Future<Output = T> + 'r) -> impl Future<Output = T> + 'r;

Apr 15 '24 17:04 james7132

@james7132 of course we can (and in the past I would've suggested exactly that), but it would be a good idea to check at least if the async functions get properly inlined with current stable rust when marked with #[inline(always)], cause just because it was a problem in the past, it might not be anymore (which was suggested elsewhere a few months ago).

Apr 15 '24 20:04 fogti

cargo bench --all --all-features currently failed. this is a hard blocker.

Apr 15 '24 20:04 fogti

Hmm, are benchmarks not built in CI? Odd.

Apr 16 '24 02:04 notgull

Some of the benchmarks appear to be a bit too small, some have a very large variance, making them unsuitable to properly measure their performance in larger systems...

multi_thread/executor::spawn_batch
                        time:   [31.896 µs 35.826 µs 39.825 µs]
                        change: [-74.713% +27.283% +332.75%] (p = 0.84 > 0.05)
                        No change in performance detected.

perf report, top overhead

# Overhead  Command          Shared Object                    Symbol                                                                                                                                                       
# ........  ...............  ...............................  ....................................................
#
    26.19%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::unbounded::Unbounded<T>::pop
     9.43%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::unbounded::Unbounded<T>::push
     7.15%  executor-6b7a70  executor-6b7a7073e5281258        [.] criterion::stats::univariate::kde::Kde<A,K>::map
     5.37%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_executor::Runner::runnable::{{closure}}
     4.09%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_executor::State::notify
     3.73%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::bounded::Bounded<T>::pop
     2.57%  executor-6b7a70  executor-6b7a7073e5281258        [.] std::sys::unix::locks::futex_mutex::Mutex::lock_contended
     2.33%  executor-6b7a70  executor-6b7a7073e5281258        [.] concurrent_queue::bounded::Bounded<T>::push_or_else
     2.19%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_task::raw::RawTask<F,T,S,M>::run
     2.08%  executor-6b7a70  executor-6b7a7073e5281258        [.] <async_task::task::Task<T,M> as core::ops::drop::Drop>::drop
     1.90%  executor-6b7a70  executor-6b7a7073e5281258        [.] async_task::raw::RawTask<F,T,S,M>::run

Apr 16 '24 11:04 fogti

I also noted the variance with spawn_batch, but that particular benchmark shouldn't be affected that strongly by this PR, beyond moving the try_tick/tick/run implementations to State. It's interesting to see the unbounded variants of the queue at the top, which corroborates potential gains from the previous attempt to directly enqueue onto local queues.

For a more realistic workload, I tested this against Bevy's many_foxes stress tests, and saw a 66% reduction in time spawning tasks in the system executor:

322801183-83b1040a-b19d-4bdb-8e94-f22ba86951ba

Apr 16 '24 17:04 james7132

Once this is rebased it can be merged.

May 12 '24 21:05 notgull

async-executor async-executor copied to clipboard

Static Executors

async-executor
async-executor copied to clipboard