async-executor
async-executor copied to clipboard
Static Executors
Resolves #111. Creates a StaticExecutor
type under a feature flag and allows constructing it from an Executor
via Executor::leak
. Unlike the executor it came from, it's a wrapper around a State
and omits all changes to active
.
Note, unlike the API proposed in #111, this PR also includes a unsafe StaticExecutor::spawn_scoped
for spawning non-'static tasks, where the caller is responsible for ensuring that the task doesn't outlive the borrowed state. This would be required for Bevy to migrate to this type, where we're currently using lifetime transmutation on Executor
to enable Thread::scope
-like APIs for working with borrowed state. StaticExecutor
does not have an external lifetime parameter so this approach is infeasible without such an API.
The performance gains while using the type are substantial:
single_thread/executor::spawn_one
time: [1.6157 µs 1.6238 µs 1.6362 µs]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
single_thread/executor::spawn_batch
time: [28.169 µs 29.650 µs 32.196 µs]
Found 19 outliers among 100 measurements (19.00%)
10 (10.00%) low severe
3 (3.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
single_thread/executor::spawn_many_local
time: [6.1952 ms 6.2230 ms 6.2578 ms]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
single_thread/executor::spawn_recursively
time: [50.202 ms 50.479 ms 50.774 ms]
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
single_thread/executor::yield_now
time: [5.8795 ms 5.8883 ms 5.8977 ms]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
multi_thread/executor::spawn_one
time: [1.2565 µs 1.2979 µs 1.3470 µs]
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
multi_thread/executor::spawn_batch
time: [38.009 µs 43.693 µs 52.882 µs]
Found 22 outliers among 100 measurements (22.00%)
21 (21.00%) high mild
1 (1.00%) high severe
Benchmarking multi_thread/executor::spawn_many_local: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 386.6s, or reduce sample count to 10.
multi_thread/executor::spawn_many_local
time: [27.492 ms 27.652 ms 27.814 ms]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high mild
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.6s, or reduce sample count to 30.
multi_thread/executor::spawn_recursively
time: [165.82 ms 166.04 ms 166.26 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
multi_thread/executor::yield_now
time: [22.469 ms 22.649 ms 22.798 ms]
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) low severe
3 (3.00%) low mild
single_thread/leaked_executor::spawn_one
time: [1.4717 µs 1.4778 µs 1.4832 µs]
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
single_thread/leaked_executor::spawn_many_local
time: [4.2622 ms 4.3065 ms 4.3489 ms]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild
single_thread/leaked_executor::spawn_recursively
time: [26.566 ms 26.899 ms 27.228 ms]
single_thread/leaked_executor::yield_now
time: [5.7200 ms 5.7270 ms 5.7342 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
multi_thread/leaked_executor::spawn_one
time: [1.3755 µs 1.4321 µs 1.4892 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
multi_thread/leaked_executor::spawn_many_local
time: [4.1838 ms 4.2394 ms 4.2989 ms]
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
multi_thread/leaked_executor::spawn_recursively
time: [43.074 ms 43.159 ms 43.241 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
multi_thread/leaked_executor::yield_now
time: [23.210 ms 23.257 ms 23.302 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
Hmmm, miri clearly does not like this even though the leak is intentional. Not sure how to tackle this without wholesale disabling leak detection.
Another alternative is to follow through with the repr(transparent)
use detailed in #111, and instead allow LeakedExecutor
to be const-constructible (would require ConcurrentQueue::unbounded
to be const). This way it could be used in a static
.
It took me a second to wrap around what this does conceptually, but now that I get it I think this is an interesting idea as an optimization. I have no opinion on the actual impl, so I'll defer to someone else to review.
As perhaps a tiny bikeshed point to raise: when I saw LeakedExecutor
it initially threw me a little, as leaking is generally undesirable, and I wasn't sure why we would want to leak an executor?
However, I think we could reframe what this does as: "it enables an executor to exist for the duration of the entire program". This is something we often called "static" (e.g. 'static, static
, lazy_static!
) and I wonder if maybe the name StaticExecutor
might work? — It's a genuine question; I'm not yet convinced this is better per se, but I wanted to raise the option to discuss.
can we ensure/check that the LeakedExecutor
methods are correctly inlined (which isn't direct clear when they contain await and such)
can we ensure/check that the
LeakedExecutor
methods are correctly inlined (which isn't direct clear when they contain await and such)
I don't know how this would be automatically checked, or if it really matters, given that the benchmarks seem to be faster than normal anyways
Another alternative is to follow through with the
repr(transparent)
use detailed in #111, and instead allowLeakedExecutor
to be const-constructible (would requireConcurrentQueue::unbounded
to be const). This way it could be used in astatic
.
Any thoughts on this construction instead? This would avoid needing to wrap a StaticExecutor
in a static variable or a StaticLocalExecutor
in a thread_local!
without the overhead of a OnceLock.
re my comment: I don't think it could be automatically checked, but maybe we should still check the assembly that gets generated for obvious unnecessary indirections (just because it is already much faster doesn't mean there aren't other easy steps possible to make it even better...
@fogti we could desugar the user-facing versions of it and force a #[inline(always)]
annotation on the State implementations for the functions. It definitely looks worse in terms of user-facing documentation, though I don't think it's a breaking change to do so:
// before
pub fn async run<T>(&self, fut: impl Future<Output = T>) -> T;
// after
pub fn run<'r, T: 'r>(&'r self, future: impl Future<Output = T> + 'r) -> impl Future<Output = T> + 'r;
@james7132 of course we can (and in the past I would've suggested exactly that), but it would be a good idea to check at least if the async
functions get properly inlined with current stable rust when marked with #[inline(always)]
, cause just because it was a problem in the past, it might not be anymore (which was suggested elsewhere a few months ago).
cargo bench --all --all-features
currently failed. this is a hard blocker.
Hmm, are benchmarks not built in CI? Odd.
Some of the benchmarks appear to be a bit too small, some have a very large variance, making them unsuitable to properly measure their performance in larger systems...
multi_thread/executor::spawn_batch
time: [31.896 µs 35.826 µs 39.825 µs]
change: [-74.713% +27.283% +332.75%] (p = 0.84 > 0.05)
No change in performance detected.
perf report, top overhead
# Overhead Command Shared Object Symbol
# ........ ............... ............................... ....................................................
#
26.19% executor-6b7a70 executor-6b7a7073e5281258 [.] concurrent_queue::unbounded::Unbounded<T>::pop
9.43% executor-6b7a70 executor-6b7a7073e5281258 [.] concurrent_queue::unbounded::Unbounded<T>::push
7.15% executor-6b7a70 executor-6b7a7073e5281258 [.] criterion::stats::univariate::kde::Kde<A,K>::map
5.37% executor-6b7a70 executor-6b7a7073e5281258 [.] async_executor::Runner::runnable::{{closure}}
4.09% executor-6b7a70 executor-6b7a7073e5281258 [.] async_executor::State::notify
3.73% executor-6b7a70 executor-6b7a7073e5281258 [.] concurrent_queue::bounded::Bounded<T>::pop
2.57% executor-6b7a70 executor-6b7a7073e5281258 [.] std::sys::unix::locks::futex_mutex::Mutex::lock_contended
2.33% executor-6b7a70 executor-6b7a7073e5281258 [.] concurrent_queue::bounded::Bounded<T>::push_or_else
2.19% executor-6b7a70 executor-6b7a7073e5281258 [.] async_task::raw::RawTask<F,T,S,M>::run
2.08% executor-6b7a70 executor-6b7a7073e5281258 [.] <async_task::task::Task<T,M> as core::ops::drop::Drop>::drop
1.90% executor-6b7a70 executor-6b7a7073e5281258 [.] async_task::raw::RawTask<F,T,S,M>::run
I also noted the variance with spawn_batch
, but that particular benchmark shouldn't be affected that strongly by this PR, beyond moving the try_tick/tick/run implementations to State
. It's interesting to see the unbounded variants of the queue at the top, which corroborates potential gains from the previous attempt to directly enqueue onto local queues.
For a more realistic workload, I tested this against Bevy's many_foxes
stress tests, and saw a 66% reduction in time spawning tasks in the system executor:
Once this is rebased it can be merged.