miri Massive slowdown caused by stacked borrows on code involving raw pointers + UnsafeCell

I was trying to minimize a massive slowdown which is observed in our code and got the following snippet:

#![feature(new_uninit)]
use std::sync::atomic::{AtomicU32, Ordering::*};

const PAGE_SIZE: usize = 1 << 12;
const PAGES: usize = 16;
const ITEMS: usize = PAGE_SIZE * PAGES;

type AtomicPage = [AtomicU32; PAGE_SIZE];

#[test]
fn init_atomic() {
    let pages: [*mut AtomicPage; PAGES] = [(); PAGES]
        .map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() })
        .map(Box::into_raw);

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = unsafe { &*(pages[page_idx]) };
        page[inpage_idx].store(u32::MAX, Release);
    }

    for page_ptr in pages {
        drop(unsafe { Box::from_raw(page_ptr) });
    }
}

On my laptop I get the following results (more than 50x slowdown):

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,049s
user	0m4,786s
sys	0m0,265s

$ time cargo miri test

real	4m19,256s
user	4m18,831s
sys	0m0,334s

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

Jul 06 '24 04:07 newpavlov

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

I cannot reproduce this result. Can you post the version that uses Box<AtomicPage>?

Jul 06 '24 16:07 saethlin

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        pages[page_idx][inpage_idx].store(u32::MAX, Release);
    }
}

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri tes

real	0m5,215s
user	0m4,970s
sys	0m0,245s

$ time cargo miri test

real	0m7,015s
user	0m6,743s
sys	0m0,274s

Interestingly, keeping the page variable results in a ~30x slowdown:

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = &pages[page_idx];
        page[inpage_idx].store(u32::MAX, Release);
    }
}

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,174s
user	0m4,862s
sys	0m0,314s

$ time cargo miri test

real	2m26,782s
user	2m26,401s
sys	0m0,361s

Jul 07 '24 03:07 newpavlov

Interestingly, keeping the page variable results in a ~30x slowdown:

There is an extremely subtle difference between those two programs. You're specifying the type to be &AtomicPage, but without the type specified that variable would be a &Box<AtomicPage>. Specifying that type is changing the borrowing structure of this program by creating additional SharedReadWrite tags in the bytes of the AtomicPages, which stresses this loop:

https://github.com/rust-lang/miri/blob/8f40dd256384883b78700bc348e7f5f272f131ca/src/borrow_tracker/stacked_borrows/mod.rs#L207-L215

Jul 07 '24 04:07 saethlin

Hunh. In fact adding the type annotation at all, even if it's fully redundant, still produces an extra reborrow. That is new to me.

Jul 07 '24 04:07 saethlin

Ah of course, the primary issue with debugging this performance issue is that the GC is doing its absolute best to cover up the problem. Perhaps a better minimization is this:

use std::cell::UnsafeCell;

const ITEMS: usize = 1 << 12;

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];

    for i in 0..ITEMS {
        let _page = &page;
    }
}

When run with -Zmiri-provenance-gc=0 this has clearly super-linear runtime, at 1 << 10 iterations I get 2.4 seconds runtime, at 1 << 11 iterations I get 7.8 seconds, and at 1 << 12 iterations I see 25 seconds.

This is very similar to our slice-get-unchecked benchmark https://github.com/rust-lang/miri/blob/8f40dd256384883b78700bc348e7f5f272f131ca/bench-cargo-miri/slice-get-unchecked/src/main.rs#L5-L12 In both cases the program is stressing a code path which in Stacked Borrows is technically a linear search of ever-growing length.

But for this search, we don't have a stack cache, so you the only mitigation to true runtime explosion is the GC.

So I have two observations: All the borrow stacks here have a Unique base tag, then are SharedReadWrite all the way up. If stacks had a bool that indicated they have this property, we could skip the loop that's causing the runtime overhead here to be so dramatic. And I suspect that this property might be pretty common, shouldn't most atomics only be accessed through SharedReadWrite?

I wanted to print the stacks for this program and I got this:

use std::cell::UnsafeCell;

extern "Rust" {
    fn miri_get_alloc_id(ptr: *const u8) -> u64;
    fn miri_print_borrow_state(alloc_id: u64, show_unnamed: bool);
}

fn get_alloc_id(ptr: *const u8) -> u64 {
    unsafe { miri_get_alloc_id(ptr) }
}

fn print_borrow_stacks(alloc_id: u64) {
    unsafe {
        miri_print_borrow_state(alloc_id, /* ignored: show_unnamed */ false)
    }
}

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];
    let id = get_alloc_id(page.as_ptr().cast());
    print_borrow_stacks(id);
}

...
4091..4092: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4092..4093: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4093..4094: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4094..4095: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4095..4096: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

Jul 07 '24 06:07 saethlin

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

Yes, but that only kicks in when we iterate the stacks and it notices two adjacent stacks being equal. It also doesn't do arbitrarily much comparing and merging, to avoid unnecessary overhead.

But here nothing at all seems to be merged, not even the first bytes:

0..1: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
1..2: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
2..3: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
...

That is indeed strange.

Jul 07 '24 07:07 RalfJung