miri icon indicating copy to clipboard operation
miri copied to clipboard

Data race detector causes massive slowdown

Open shepmaster opened this issue 4 years ago • 15 comments

Steps to reproduce:

Check out https://github.com/shepmaster/sxd-string-slab/commit/8bf847eff2a8a1609c818cae68d227156f5f3dc1, then run Miri:

time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
  CARGO_TARGET_DIR=target/miri \
  cargo miri test -- --nocapture
# Reports ~23 seconds on my machine

time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-data-race-detector' \
  CARGO_TARGET_DIR=target/miri \
  cargo miri test -- --nocapture
# Reports ~16.5 minutes on my machine

time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows' \
  CARGO_TARGET_DIR=target/miri \
  cargo miri test -- --nocapture
# Reports ~4.25 minutes on my machine

Meta

  • macOS

  • rustc --version --verbose

    rustc 1.51.0-nightly (1d0d76f8d 2021-01-24)
    binary: rustc
    commit-hash: 1d0d76f8dd4f5f6ecbeab575b87edaf1c9f56bb8
    commit-date: 2021-01-24
    host: x86_64-apple-darwin
    release: 1.51.0-nightly
    LLVM version: 11.0.1
    
  • cargo miri --version --verbose

    miri 0.1.0 (1cf1a2e4 2021-01-15)
    

shepmaster avatar Jan 25 '21 15:01 shepmaster

Cc @JCTyblaidd any idea what could be happening here? This is a 10x slowdown due to the data race detector.

OTOH, it's also a 40x slowdown due to Stacked Borrows, which is much more than what I observed elsewhere.

RalfJung avatar Jan 25 '21 18:01 RalfJung

I will look into this properly later:

My guess would be that criterion is spawning a large set of threads for which the data-race detector is enabled for, potentially a large set with delayed synchronization - so from the perspective of the data-race detector there is a very large set of live concurrent threads to detect races for (probably crossing the smallvec threshold). I need to try track thread usage for a proper analysis.

JCTyblaidd avatar Jan 27 '21 11:01 JCTyblaidd

that criterion is spawning a large set of threads

The code in question is not using criterion (the benchmarks are in a separate child crate). This code is using proptest. When I run the tests outside of Miri, I see that the CPU is pegged at 100%, indicating that this specific test isn't making heavy use of multiple threads (it may still be creating threads that are just chilling, however).

I also asked the test runner to use only one thread (--test-threads 1), with a small but not huge change:

time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows' \
  CARGO_TARGET_DIR=target/miri \
  cargo miri test -- --nocapture --test-threads 1
# Reports ~3.5 minutes on my machine

Additionally, it doesn't look like there's much use of multithreading from a CPU usage perspective outside of Miri:

time cargo test -- --nocapture

# real	7.088	7088012us
# user	6.970	6970495us
# sys	0.258	257838us
# cpu	101%

shepmaster avatar Jan 27 '21 15:01 shepmaster

I also asked the test runner to use only one thread (--test-threads 1), with a small but not huge change:

--test-threads 1 should be the default under Miri anyway (when asked for how many CPUs the system has, Miri always returns 1). But proptest itself might spawn threads. However, Miri will print "thread support is experimental" when the first non-main thread is spawned; do you see that message?

RalfJung avatar Jan 27 '21 17:01 RalfJung

do you see that message?

I do not
MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
  CARGO_TARGET_DIR=target/miri \
  cargo miri test -- --nocapture
   Compiling sxd-string-slab v0.1.0 (/Users/shep/Projects/sxd/string-slab)
    Finished test [unoptimized + debuginfo] target(s) in 0.24s
     Running target/miri/x86_64-apple-darwin/debug/deps/sxd_string_slab-d5887c0cf1c6b66d

running 5 tests
test test::all_interned_keys_equal_each_other ... ok
test test::interning_has_equal_pointer ... ok
test test::interning_preserves_string_values ... ok
test test::interning_twice_creates_equal_values ... ok
test test::interning_two_values_creates_non_equal_values ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests sxd-string-slab
Running doctests is not currently supported by Miri.

shepmaster avatar Jan 27 '21 18:01 shepmaster

If there is always exactly 1 thread, then the main data-race detection code should be never run.

Then the other possibility is that there is a large difference, maybe due to lack of inlining, between the first two data-race detection skip checks (globally disabled vs only 1 active thread). Might want to extract:

if self.global.multi_threaded.get() {

from read and write/unique_access into an inline always function or mark read, write, unique_access as inline.

JCTyblaidd avatar Jan 28 '21 15:01 JCTyblaidd

I can't repro performance differences between with/without -Zmiri-disable-data-race-detector on x86_64-apple-darwin on a build of miri master (39a7bd09f16c6556747fe1ebe0e167e1c3080e58).

A quick profile with both the race detector and stacked borrows on gives the following (inverted) call tree:

33.29 s   69.4%	33.29 s	 	  _$LT$core..iter..adapters..rev..Rev$LT$I$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::try_fold::hae1347c1e39f9ef1
33.29 s   69.3%	0 s	 	   miri::stacked_borrows::Stack::find_granting::h1bc045f8cfcb46a2
18.59 s   38.7%	0 s	 	    miri::stacked_borrows::Stack::grant::hd3f875dd7ebdcaa7
18.54 s   38.6%	0 s	 	     miri::stacked_borrows::Stacks::for_each::hdedf8d6a52690627
18.54 s   38.6%	0 s	 	      miri::helpers::EvalContextExt::visit_freeze_sensitive::hcd00899a5604a512
18.54 s   38.6%	0 s	 	       miri::stacked_borrows::EvalContextPrivExt::reborrow::h36398b4336eca798
18.54 s   38.6%	0 s	 	        miri::stacked_borrows::EvalContextPrivExt::retag_reference::hdb62d7d894f5f35d
18.54 s   38.6%	0 s	 	         miri::stacked_borrows::EvalContextExt::retag::h135215746e29b006
... <snip>
14.70 s   30.6%	0 s	 	    miri::stacked_borrows::Stack::access::h1e11b7d9a47fe816
14.69 s   30.6%	0 s	 	     miri::stacked_borrows::Stack::grant::hd3f875dd7ebdcaa7
14.64 s   30.5%	0 s	 	      miri::stacked_borrows::Stacks::for_each::hdedf8d6a52690627
14.64 s   30.5%	0 s	 	       miri::helpers::EvalContextExt::visit_freeze_sensitive::hcd00899a5604a512
14.64 s   30.5%	0 s	 	        miri::stacked_borrows::EvalContextPrivExt::reborrow::h36398b4336eca798
14.64 s   30.5%	0 s	 	         miri::stacked_borrows::EvalContextPrivExt::retag_reference::hdb62d7d894f5f35d
14.64 s   30.5%	0 s	 	          miri::stacked_borrows::EvalContextExt::retag::h135215746e29b006
14.64 s   30.5%	0 s	 	           rustc_mir::interpret::step::_$LT$impl$u20$rustc_mir..interpret..eval_context..InterpCx$LT$M$GT$$GT$::statement::h25ad5f561a1df15d

Which points at https://github.com/rust-lang/miri/blob/master/src/stacked_borrows.rs#L248 (Stack::find_granting) via https://github.com/rust-lang/miri/blob/master/src/stacked_borrows.rs#L522 (EvalContextPrivExt::reborrow). That said, I somewhat imagine it's not surprising that this is a slow part of stacked borrow eval.

thomcc avatar Feb 01 '21 02:02 thomcc

Perhaps it's been fixed in Miri already? I still see it with Miri 1cf1a2e4:

Run log
/tmp % git clone https://github.com/shepmaster/sxd-string-slab/
Cloning into 'sxd-string-slab'...
remote: Enumerating objects: 67, done.
remote: Counting objects: 100% (67/67), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 67 (delta 24), reused 65 (delta 23), pack-reused 0
Receiving objects: 100% (67/67), 18.75 KiB | 6.25 MiB/s, done.
Resolving deltas: 100% (24/24), done.

/tmp % cd sxd-string-slab
sxd-string-slab % git checkout 8bf847eff2a8a1609c818cae68d227156f5f3dc1
Note: switching to '8bf847eff2a8a1609c818cae68d227156f5f3dc1'.

# Blah blah git

HEAD is now at 8bf847e Add basic performance and memory benchmarks

sxd-string-slab % time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
  CARGO_TARGET_DIR=target/miri \
  cargo +nightly miri test -- --nocapture

# Initial build, I killed it to get a clean timing

sxd-string-slab % time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector' \
  CARGO_TARGET_DIR=target/miri \
  cargo +nightly miri test -- --nocapture
   Compiling sxd-string-slab v0.1.0 (/private/tmp/sxd-string-slab)
    Finished test [unoptimized + debuginfo] target(s) in 0.05s
     Running target/miri/x86_64-apple-darwin/debug/deps/sxd_string_slab-9c3b8d0214db5420

running 5 tests
test test::all_interned_keys_equal_each_other ... ok
test test::interning_has_equal_pointer ... ok
test test::interning_preserves_string_values ... ok
test test::interning_twice_creates_equal_values ... ok
test test::interning_two_values_creates_non_equal_values ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests sxd-string-slab
Running doctests is not currently supported by Miri.

real1:28.8988886665us
user1:24.2784273102us
sys2.3442344129us
cpu97%
mem570664 KiB

sxd-string-slab % time MIRIFLAGS='-Zmiri-disable-isolation -Zmiri-disable-stacked-borrows' \
  CARGO_TARGET_DIR=target/miri \
  cargo +nightly miri test -- --nocapture
    Finished test [unoptimized + debuginfo] target(s) in 0.15s
     Running target/miri/x86_64-apple-darwin/debug/deps/sxd_string_slab-9c3b8d0214db5420

running 5 tests
test test::all_interned_keys_equal_each_other ... ok
test test::interning_has_equal_pointer ... ok
test test::interning_preserves_string_values ... ok
test test::interning_twice_creates_equal_values ... ok
test test::interning_two_values_creates_non_equal_values ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests sxd-string-slab
Running doctests is not currently supported by Miri.

real2:15.65135650094us
user2:08.57128569996us
sys3.4883488009us
cpu97%
mem969060 KiB

sxd-string-slab % cargo +nightly miri --version
miri 0.1.0 (1cf1a2e4 2021-01-15)

sxd-string-slab % cargo +nightly --version
cargo 1.51.0-nightly (783bc43c6 2021-01-20)

shepmaster avatar Feb 01 '21 16:02 shepmaster

Nothing major changed in Miri that would explain such a fix...

RalfJung avatar Feb 01 '21 16:02 RalfJung

For clarity, what I meant by my inability to repro performance differences is: it still takes about 5 minutes (on a decently fast but not blistering 2019 macbook) to run with both -Zmiri-disable-stacked-borrows and -Zmiri-disable-data-race-detector.

Which is to say, I don't see the 23s. So, the opposite of "it got fixed on master" might be possible: perhaps whatever was making that fast in your case has regressed on master.

Or it could just be weird compiler shenanigans (this is actually somewhat plausible to me if the inlining theory here https://github.com/rust-lang/miri/issues/1689#issuecomment-769173377 is true, although this would be a crazy amount of perf to get from extra inlining, tbh)

Unfortunately, I'm not 100% sure how to go back to rustup-managed miri/cargo-miri after running ./miri install so I can't easily check this.

thomcc avatar Feb 01 '21 17:02 thomcc

Removing ~/.cargo/bin/miri and ~/.cargo/bin/cargo-miri followed by rustup self update would work I think.

bjorn3 avatar Feb 01 '21 17:02 bjorn3

Removing ~/.cargo/bin/miri and ~/.cargo/bin/cargo-miri followed by rustup self update would work I think.

I just tried this and, erm, don't recommend it (even with rustup update and with manually reinstalling the nightly toolchain) since I think I may have somewhat messed up my setup trying to make it work. sigh

This is off topic for this discussion though.

thomcc avatar Feb 01 '21 17:02 thomcc

I mean rustup self update, not rustup update. rustup self update will re-install all rustup shims in ~/.cargo/bin.

bjorn3 avatar Feb 01 '21 17:02 bjorn3

The runtime of this example is extremely erratic. I just ran the tests a few times with SB and the data race detector off, and I'm seeing runtimes between 8 and 80 seconds. When I profile Miri with the data race detector on, all I can find is ~2% of CPU cycles in total associate with the concurrency module.

Under normal execution the run-to-run variance is 15% (from perf stat -r10 cargo test). With --jobs=1 -- --test-threads=1, the run-to-run variance is 1% which is what I would expect. If I pass those flags to Miri, the run-to-run variance is reduced but still huge, .

So I suspect that some somehow the interpreter is amplifying some randomization in proptest. If we're randomly generating test inputs there are probably some inputs which happen to encounter code paths in the interpreter which are very slow.

In any case, this slowdown would be a good thing to work on... If I could reproduce it reliably (I can't), and if I could diagnose it with a profile (I can't).

saethlin avatar Jul 10 '22 05:07 saethlin

The serde2 benchmark was meant to capture some concurrency, but it doesn't help show this slowdown at all. That is possibly because there is only read-only state being shared across threads, no atomic accesses or anything like that. But I also don't know what a better concurrency benchmark would be.

RalfJung avatar Aug 07 '22 02:08 RalfJung

I was trying to minimize a massive slowdown which is observed in our code and got the following snippet:

#![feature(new_uninit)]
use std::sync::atomic::{AtomicU32, Ordering::*};

const PAGE_SIZE: usize = 1 << 12;
const PAGES: usize = 16;
const ITEMS: usize = PAGE_SIZE * PAGES;

type AtomicPage = [AtomicU32; PAGE_SIZE];

#[test]
fn init_atomic() {
    let pages: [*mut AtomicPage; PAGES] = [(); PAGES]
        .map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() })
        .map(Box::into_raw);

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = unsafe { &*(pages[page_idx]) };
        page[inpage_idx].store(u32::MAX, Release);
    }

    for page_ptr in pages {
        drop(unsafe { Box::from_raw(page_ptr) });
    }
}

On my laptop I get the following results (more than 50x slowdown):

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,049s
user	0m4,786s
sys	0m0,265s
$ time cargo miri test

real	4m19,256s
user	4m18,831s
sys	0m0,334s

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

newpavlov avatar Jul 05 '24 15:07 newpavlov

@newpavlov You are reporting a slowdown with Stacked Borrows, not the data race detector, which is the topic of this issue. Your example does not have a significant slowdown due to the data race detector.

saethlin avatar Jul 05 '24 19:07 saethlin

Since nobody seems to have been able to reproduce this... how sure are we that the slowdown is even real?

I tried two stress tests that have many (albeit short-lived) threads and got around a 20% slowdown in both cases, which I would not qualify as "massive". This was comparing -Zmiri-disable-stacked-borrows with -Zmiri-disable-stacked-borrows -Zmiri-disable-data-race-detector, so given the much larger slowdown of Stacked Borrows, this will be hardly noticeable when comparing the default with -Zmiri-disable-data-race-detector. (My stress tests were tests/pass/0weak_memory_consistency.rs from the repo and the scope_join_race test from the standard library.)

RalfJung avatar Jul 05 '24 21:07 RalfJung

how sure are we that the slowdown is even real?

Considering that nobody has managed to reproduce such a slowdown on an issue that's over 3 years old I'm just going to close this. If anyone finds a program that spends more than half its runtime in the data race detector I would rather see a fresh issue filed about that program.

saethlin avatar Jul 05 '24 21:07 saethlin