Weird different benchmark results for code that should be fairly identical
I have some benchmarks that looks like this:
use std::mem::MaybeUninit;
fn main() {
let _ = memcache::CRATE_USED;
divan::main();
}
fn weird_results_impl(b: divan::Bencher, size: usize) {
const NUM_ITEMS: usize = 100_000;
const CAPACITY: usize = NUM_ITEMS;
let cache = vec![Default::default(); CAPACITY];
let values = (0..NUM_ITEMS)
.map(|_| vec![std::mem::MaybeUninit::<u8>::uninit(); size].into_boxed_slice())
.collect::<Vec<_>>();
b.counter(divan::counter::ItemsCount::new(NUM_ITEMS))
.with_inputs(|| {
(
cache.clone(),
values
.iter()
.enumerate()
.map(|(idx, v)| (idx % CAPACITY, v.clone()))
.collect::<Vec<_>>(),
)
})
.bench_local_refs(|(cache, refs)| {
for (entry, mem) in refs {
cache[*entry] = std::mem::take(mem);
}
});
}
#[divan::bench]
fn weird_results_4kib(b: divan::Bencher) {
weird_results_impl(b, 4 * 1024);
}
#[divan::bench]
fn weird_results_10b(b: divan::Bencher) {
weird_results_impl(b, 10);
}
There's a fairly large discrepancy between the two
my-crate fastest │ slowest │ median │ mean │ samples │ iters
├─ weird_results_4kib 165.4 µs │ 211.1 µs │ 173.5 µs │ 174.8 µs │ 100 │ 100
│ 604.2 Mitem/s │ 473.5 Mitem/s │ 576.2 Mitem/s │ 571.8 Mitem/s │ │
╰─ weird_results_10b 80.53 µs │ 110.5 µs │ 83.22 µs │ 84.07 µs │ 100 │ 100
1.241 Gitem/s │ 904.2 Mitem/s │ 1.201 Gitem/s │ 1.189 Gitem/s │ │
This was run with mimalloc set as the allocator. AFAICT I'm not dropping any memory within the benchmark loop and the body of the loop shouldn't be doing anything more than shuffling some pointers around (i.e. should be the same amount of shuffling between the two runs I think). Is there something wrong with my benchmark or a bug in divan?
I'm not able to reproduce your results when I don't add memcache or mimalloc. I can try again later with those added.
Also, you can use NUM_ITEMS directly since usize implements IntoCounter<Counter = ItemsCount>:
- b.counter(divan::counter::ItemsCount::new(NUM_ITEMS))
+ b.counter(NUM_ITEMS)
memcache is the name of my own crate - can be ignored. Strange that you're not seeing it. mimalloc might be needed to make things more obvious. You must have a faster machine for this benchmark since my 13900 doesn't get that fast with the standard allocator.
My suspicion is that:
The benchmarks are varied by the size parameter, which in turn affects the size of the values slices, which makes the clone inside .with_inputs(..) more expensive.
The with_inputs time seems to be caught in the main benchmark time, hence the large difference between the benchmarks.
See #55