miri
miri copied to clipboard
Data race detector causes massive slowdown
Steps to reproduce:
Check out https://github.com/shepmaster/sxd-string-slab/commit/8bf847eff2a8a1609c818cae68d227156f5f3dc1, then run Miri:
time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
CARGO_TARGET_DIR=target/miri \
cargo miri test -- --nocapture
# Reports ~23 seconds on my machine
time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-data-race-detector' \
CARGO_TARGET_DIR=target/miri \
cargo miri test -- --nocapture
# Reports ~16.5 minutes on my machine
time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows' \
CARGO_TARGET_DIR=target/miri \
cargo miri test -- --nocapture
# Reports ~4.25 minutes on my machine
Meta
-
macOS
-
rustc --version --verboserustc 1.51.0-nightly (1d0d76f8d 2021-01-24) binary: rustc commit-hash: 1d0d76f8dd4f5f6ecbeab575b87edaf1c9f56bb8 commit-date: 2021-01-24 host: x86_64-apple-darwin release: 1.51.0-nightly LLVM version: 11.0.1 -
cargo miri --version --verbosemiri 0.1.0 (1cf1a2e4 2021-01-15)
Cc @JCTyblaidd any idea what could be happening here? This is a 10x slowdown due to the data race detector.
OTOH, it's also a 40x slowdown due to Stacked Borrows, which is much more than what I observed elsewhere.
I will look into this properly later:
My guess would be that criterion is spawning a large set of threads for which the data-race detector is enabled for, potentially a large set with delayed synchronization - so from the perspective of the data-race detector there is a very large set of live concurrent threads to detect races for (probably crossing the smallvec threshold). I need to try track thread usage for a proper analysis.
that criterion is spawning a large set of threads
The code in question is not using criterion (the benchmarks are in a separate child crate). This code is using proptest. When I run the tests outside of Miri, I see that the CPU is pegged at 100%, indicating that this specific test isn't making heavy use of multiple threads (it may still be creating threads that are just chilling, however).
I also asked the test runner to use only one thread (--test-threads 1), with a small but not huge change:
time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows' \
CARGO_TARGET_DIR=target/miri \
cargo miri test -- --nocapture --test-threads 1
# Reports ~3.5 minutes on my machine
Additionally, it doesn't look like there's much use of multithreading from a CPU usage perspective outside of Miri:
time cargo test -- --nocapture
# real 7.088 7088012us
# user 6.970 6970495us
# sys 0.258 257838us
# cpu 101%
I also asked the test runner to use only one thread (--test-threads 1), with a small but not huge change:
--test-threads 1 should be the default under Miri anyway (when asked for how many CPUs the system has, Miri always returns 1).
But proptest itself might spawn threads. However, Miri will print "thread support is experimental" when the first non-main thread is spawned; do you see that message?
do you see that message?
I do not
MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
CARGO_TARGET_DIR=target/miri \
cargo miri test -- --nocapture
Compiling sxd-string-slab v0.1.0 (/Users/shep/Projects/sxd/string-slab)
Finished test [unoptimized + debuginfo] target(s) in 0.24s
Running target/miri/x86_64-apple-darwin/debug/deps/sxd_string_slab-d5887c0cf1c6b66d
running 5 tests
test test::all_interned_keys_equal_each_other ... ok
test test::interning_has_equal_pointer ... ok
test test::interning_preserves_string_values ... ok
test test::interning_twice_creates_equal_values ... ok
test test::interning_two_values_creates_non_equal_values ... ok
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests sxd-string-slab
Running doctests is not currently supported by Miri.
If there is always exactly 1 thread, then the main data-race detection code should be never run.
Then the other possibility is that there is a large difference, maybe due to lack of inlining, between the first two data-race detection skip checks (globally disabled vs only 1 active thread). Might want to extract:
if self.global.multi_threaded.get() {
from read and write/unique_access into an inline always function or mark read, write, unique_access as inline.
I can't repro performance differences between with/without -Zmiri-disable-data-race-detector on x86_64-apple-darwin on a build of miri master (39a7bd09f16c6556747fe1ebe0e167e1c3080e58).
A quick profile with both the race detector and stacked borrows on gives the following (inverted) call tree:
33.29 s 69.4% 33.29 s _$LT$core..iter..adapters..rev..Rev$LT$I$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::try_fold::hae1347c1e39f9ef1
33.29 s 69.3% 0 s miri::stacked_borrows::Stack::find_granting::h1bc045f8cfcb46a2
18.59 s 38.7% 0 s miri::stacked_borrows::Stack::grant::hd3f875dd7ebdcaa7
18.54 s 38.6% 0 s miri::stacked_borrows::Stacks::for_each::hdedf8d6a52690627
18.54 s 38.6% 0 s miri::helpers::EvalContextExt::visit_freeze_sensitive::hcd00899a5604a512
18.54 s 38.6% 0 s miri::stacked_borrows::EvalContextPrivExt::reborrow::h36398b4336eca798
18.54 s 38.6% 0 s miri::stacked_borrows::EvalContextPrivExt::retag_reference::hdb62d7d894f5f35d
18.54 s 38.6% 0 s miri::stacked_borrows::EvalContextExt::retag::h135215746e29b006
... <snip>
14.70 s 30.6% 0 s miri::stacked_borrows::Stack::access::h1e11b7d9a47fe816
14.69 s 30.6% 0 s miri::stacked_borrows::Stack::grant::hd3f875dd7ebdcaa7
14.64 s 30.5% 0 s miri::stacked_borrows::Stacks::for_each::hdedf8d6a52690627
14.64 s 30.5% 0 s miri::helpers::EvalContextExt::visit_freeze_sensitive::hcd00899a5604a512
14.64 s 30.5% 0 s miri::stacked_borrows::EvalContextPrivExt::reborrow::h36398b4336eca798
14.64 s 30.5% 0 s miri::stacked_borrows::EvalContextPrivExt::retag_reference::hdb62d7d894f5f35d
14.64 s 30.5% 0 s miri::stacked_borrows::EvalContextExt::retag::h135215746e29b006
14.64 s 30.5% 0 s rustc_mir::interpret::step::_$LT$impl$u20$rustc_mir..interpret..eval_context..InterpCx$LT$M$GT$$GT$::statement::h25ad5f561a1df15d
Which points at https://github.com/rust-lang/miri/blob/master/src/stacked_borrows.rs#L248 (Stack::find_granting) via https://github.com/rust-lang/miri/blob/master/src/stacked_borrows.rs#L522 (EvalContextPrivExt::reborrow). That said, I somewhat imagine it's not surprising that this is a slow part of stacked borrow eval.
Perhaps it's been fixed in Miri already? I still see it with Miri 1cf1a2e4:
Run log
/tmp % git clone https://github.com/shepmaster/sxd-string-slab/
Cloning into 'sxd-string-slab'...
remote: Enumerating objects: 67, done.
remote: Counting objects: 100% (67/67), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 67 (delta 24), reused 65 (delta 23), pack-reused 0
Receiving objects: 100% (67/67), 18.75 KiB | 6.25 MiB/s, done.
Resolving deltas: 100% (24/24), done.
/tmp % cd sxd-string-slab
sxd-string-slab % git checkout 8bf847eff2a8a1609c818cae68d227156f5f3dc1
Note: switching to '8bf847eff2a8a1609c818cae68d227156f5f3dc1'.
# Blah blah git
HEAD is now at 8bf847e Add basic performance and memory benchmarks
sxd-string-slab % time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
CARGO_TARGET_DIR=target/miri \
cargo +nightly miri test -- --nocapture
# Initial build, I killed it to get a clean timing
sxd-string-slab % time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
CARGO_TARGET_DIR=target/miri \
cargo +nightly miri test -- --nocapture
Compiling sxd-string-slab v0.1.0 (/private/tmp/sxd-string-slab)
Finished test [unoptimized + debuginfo] target(s) in 0.05s
Running target/miri/x86_64-apple-darwin/debug/deps/sxd_string_slab-9c3b8d0214db5420
running 5 tests
test test::all_interned_keys_equal_each_other ... ok
test test::interning_has_equal_pointer ... ok
test test::interning_preserves_string_values ... ok
test test::interning_twice_creates_equal_values ... ok
test test::interning_two_values_creates_non_equal_values ... ok
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests sxd-string-slab
Running doctests is not currently supported by Miri.
real1:28.8988886665us
user1:24.2784273102us
sys2.3442344129us
cpu97%
mem570664 KiB
sxd-string-slab % time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows' \
CARGO_TARGET_DIR=target/miri \
cargo +nightly miri test -- --nocapture
Finished test [unoptimized + debuginfo] target(s) in 0.15s
Running target/miri/x86_64-apple-darwin/debug/deps/sxd_string_slab-9c3b8d0214db5420
running 5 tests
test test::all_interned_keys_equal_each_other ... ok
test test::interning_has_equal_pointer ... ok
test test::interning_preserves_string_values ... ok
test test::interning_twice_creates_equal_values ... ok
test test::interning_two_values_creates_non_equal_values ... ok
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests sxd-string-slab
Running doctests is not currently supported by Miri.
real2:15.65135650094us
user2:08.57128569996us
sys3.4883488009us
cpu97%
mem969060 KiB
sxd-string-slab % cargo +nightly miri --version
miri 0.1.0 (1cf1a2e4 2021-01-15)
sxd-string-slab % cargo +nightly --version
cargo 1.51.0-nightly (783bc43c6 2021-01-20)
Nothing major changed in Miri that would explain such a fix...
For clarity, what I meant by my inability to repro performance differences is: it still takes about 5 minutes (on a decently fast but not blistering 2019 macbook) to run with both -Zmiri-disable-stacked-borrows and -Zmiri-disable-data-race-detector.
Which is to say, I don't see the 23s. So, the opposite of "it got fixed on master" might be possible: perhaps whatever was making that fast in your case has regressed on master.
Or it could just be weird compiler shenanigans (this is actually somewhat plausible to me if the inlining theory here https://github.com/rust-lang/miri/issues/1689#issuecomment-769173377 is true, although this would be a crazy amount of perf to get from extra inlining, tbh)
Unfortunately, I'm not 100% sure how to go back to rustup-managed miri/cargo-miri after running ./miri install so I can't easily check this.
Removing ~/.cargo/bin/miri and ~/.cargo/bin/cargo-miri followed by rustup self update would work I think.
Removing
~/.cargo/bin/miriand~/.cargo/bin/cargo-mirifollowed byrustup self updatewould work I think.
I just tried this and, erm, don't recommend it (even with rustup update and with manually reinstalling the nightly toolchain) since I think I may have somewhat messed up my setup trying to make it work. sigh
This is off topic for this discussion though.
I mean rustup self update, not rustup update. rustup self update will re-install all rustup shims in ~/.cargo/bin.
The runtime of this example is extremely erratic. I just ran the tests a few times with SB and the data race detector off, and I'm seeing runtimes between 8 and 80 seconds. When I profile Miri with the data race detector on, all I can find is ~2% of CPU cycles in total associate with the concurrency module.
Under normal execution the run-to-run variance is 15% (from perf stat -r10 cargo test). With --jobs=1 -- --test-threads=1, the run-to-run variance is 1% which is what I would expect. If I pass those flags to Miri, the run-to-run variance is reduced but still huge, .
So I suspect that some somehow the interpreter is amplifying some randomization in proptest. If we're randomly generating test inputs there are probably some inputs which happen to encounter code paths in the interpreter which are very slow.
In any case, this slowdown would be a good thing to work on... If I could reproduce it reliably (I can't), and if I could diagnose it with a profile (I can't).
The serde2 benchmark was meant to capture some concurrency, but it doesn't help show this slowdown at all. That is possibly because there is only read-only state being shared across threads, no atomic accesses or anything like that. But I also don't know what a better concurrency benchmark would be.
I was trying to minimize a massive slowdown which is observed in our code and got the following snippet:
#![feature(new_uninit)]
use std::sync::atomic::{AtomicU32, Ordering::*};
const PAGE_SIZE: usize = 1 << 12;
const PAGES: usize = 16;
const ITEMS: usize = PAGE_SIZE * PAGES;
type AtomicPage = [AtomicU32; PAGE_SIZE];
#[test]
fn init_atomic() {
let pages: [*mut AtomicPage; PAGES] = [(); PAGES]
.map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() })
.map(Box::into_raw);
for i in 0..ITEMS {
let page_idx = i / PAGE_SIZE;
let inpage_idx = i % PAGE_SIZE;
let page: &AtomicPage = unsafe { &*(pages[page_idx]) };
page[inpage_idx].store(u32::MAX, Release);
}
for page_ptr in pages {
drop(unsafe { Box::from_raw(page_ptr) });
}
}
On my laptop I get the following results (more than 50x slowdown):
$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test
real 0m5,049s
user 0m4,786s
sys 0m0,265s
$ time cargo miri test
real 4m19,256s
user 4m18,831s
sys 0m0,334s
Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.
@newpavlov You are reporting a slowdown with Stacked Borrows, not the data race detector, which is the topic of this issue. Your example does not have a significant slowdown due to the data race detector.
Since nobody seems to have been able to reproduce this... how sure are we that the slowdown is even real?
I tried two stress tests that have many (albeit short-lived) threads and got around a 20% slowdown in both cases, which I would not qualify as "massive". This was comparing -Zmiri-disable-stacked-borrows with -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector, so given the much larger slowdown of Stacked Borrows, this will be hardly noticeable when comparing the default with -Zmiri-disable-data-race-detector. (My stress tests were tests/pass/0weak_memory_consistency.rs from the repo and the scope_join_race test from the standard library.)
how sure are we that the slowdown is even real?
Considering that nobody has managed to reproduce such a slowdown on an issue that's over 3 years old I'm just going to close this. If anyone finds a program that spends more than half its runtime in the data race detector I would rather see a fresh issue filed about that program.