criterion.rs
criterion.rs copied to clipboard
iter_batched overhead
While doing some measurements I was thrown off by the way they changed depending on the setup. According to the docs iter_batched
"[t]imes a routine that requires some input by generating a batch of input, then timing the iteration of the benchmark over the input." The example is a simple sort b.iter_batched(|| data.clone(), |mut data| sort(&mut data), BatchSize::SmallInput)
.
Running an example like this on large inputs however yields a large overhead depending on the size of the input data
.
fn benchmarks(c: &mut Criterion) {
for &n in [8000, 16000, 24000].iter() {
let data = vec![0u8; n];
c.bench_function(&format!("Operation on vec of size {}", n), |b| {
b.iter_batched(|| data.clone(), |_data| {}, BatchSize::SmallInput)
});
}
}
Running this snipet gives me
Operation on vec of size 8000
time: [405.16 ns 418.89 ns 430.70 ns]
Found 17 outliers among 100 measurements (17.00%)
17 (17.00%) low severe
Operation on vec of size 16000
time: [684.50 ns 697.79 ns 709.61 ns]
Found 20 outliers among 100 measurements (20.00%)
20 (20.00%) low mild
Operation on vec of size 24000
time: [1.0689 us 1.0979 us 1.1206 us]
Found 19 outliers among 100 measurements (19.00%)
19 (19.00%) low mild
This appears to be a pretty big overhead, which makes it impossible to measure small operations.
This seems to be the overhead of copying the Vec
into the closure |_data| {}
.
In short, I think you want to use iter_batched_ref
, unless you absolutely must pass the vector by value.
Try these benchmarks...
use criterion::{criterion_group, criterion_main, BatchSize, Criterion};
criterion_main!(tests);
criterion_group!(
tests,
benchmarks,
benchmarks_pass_ref,
benchmarks_ref
);
fn benchmarks(c: &mut Criterion) {
for &n in [8000, 16000, 24000].iter() {
let data = vec![0u8; n];
c.bench_function(&format!("batched: Operation on vec of size {}", n), |b| {
b.iter_batched(|| data.clone(), |_data| {}, BatchSize::SmallInput)
});
}
}
fn benchmarks_pass_ref(c: &mut Criterion) {
for &n in [8000, 16000, 24000].iter() {
let data = vec![0u8; n];
c.bench_function(
&format!("batched: Operation on ref to vec of size {}", n),
|b| b.iter_batched(|| &data, |_data| {}, BatchSize::SmallInput),
);
}
}
fn benchmarks_ref(c: &mut Criterion) {
for &n in [8000, 16000, 24000].iter() {
let data = vec![0u8; n];
c.bench_function(
&format!("batched_ref: Operation on vec of size {}", n),
|b| b.iter_batched_ref(|| data.clone(), |_data| {}, BatchSize::SmallInput),
);
}
}
Doing this I get:
batched: Operation on vec of size 8000
time: [390.65 ns 458.90 ns 533.21 ns]
batched: Operation on vec of size 16000
time: [691.45 ns 768.81 ns 844.51 ns]
batched: Operation on vec of size 24000
time: [818.16 ns 878.17 ns 943.36 ns]
batched: Operation on ref to vec of size 8000
time: [725.79 ps 757.43 ps 801.51 ps]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
batched: Operation on ref to vec of size 16000
time: [745.11 ps 798.38 ps 855.88 ps]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
batched: Operation on ref to vec of size 24000
time: [753.82 ps 805.82 ps 862.83 ps]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
batched_ref: Operation on vec of size 8000
time: [8.7324 ps 9.4074 ps 10.290 ps]
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
batched_ref: Operation on vec of size 16000
time: [18.038 ps 19.569 ps 21.523 ps]
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
batched_ref: Operation on vec of size 24000
time: [26.565 ps 28.632 ps 31.292 ps]
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
Yeah I also think that it's coming from copying the vector. It's still a little unexpected. The ref version is definitely better but not necessarily what I want to use if the setup information really has to be copied/fresh.
So I think what I want to say is that I find the behavior a little unexpected and the documentation doesn't really mention the problem. In fact, it suggests to use it like this. Maybe it would be good to extend the documentation in a way that makes it clear that the measurements include the time needed to copy the setup into the routine. The description currently gives the following timing model.
elapsed = (Instant::now * num_batches) + (iters * (routine + O::drop)) + Vec::extend
I actually encountered a similar unexpected result. Consider the following benchmark:
pub fn criterion_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("criterion");
let len = 1500;
let mut node = vec![0_u8; len];
let data = vec![42_u8; 10*len];
group.bench_function("no batch",
|b| b.iter(|| change(&data, &mut node, len)));
group.bench_function("BatchSize::LargeInput",
|b| b.iter_batched_ref(|| data.clone(),
|data| change(&data, &mut node, len), BatchSize::LargeInput));
group.bench_function("BatchSize::SmallInput",
|b| b.iter_batched_ref(|| data.clone(),
|data| change(&data, &mut node, len), BatchSize::SmallInput));
group.finish();
}
pub fn change(data: &[u8], node: &mut [u8], len: usize) {
for curr in data.chunks_exact(len) {
for (a,b) in node.iter_mut().zip(curr.iter()) {
*a ^= b;
}
}
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
From my understanding of the documentation, BatchSize::SmallInput
should incur less measurement overhead than BatchSize::LargeInput
, but indeed I get the opposite behavior:
criterion/no batch time: [554.98 ns 556.80 ns 558.85 ns]
criterion/BatchSize::LargeInput
time: [668.32 ns 673.33 ns 677.91 ns]
criterion/BatchSize::SmallInput
time: [1.1419 us 1.1478 us 1.1550 us]
Do you have an idea why this could be the case? I was wondering if perhaps the data
clones are implicitly dropped in each iteration and tried to explicitly return them to ensure "drop collection" as with iter_with_large_drop
but that resulted in a lifetime error.
I was seeing similar issues so I tried running the benchmarks 3 ways:
-
iter_custom
with theclone
outside the timing -
iter_custom
with theclone
inside the timing -
iter_batched
And it looked like the second and third timings matched and the first was much "faster". So it seems like iter_batched
is including the setup function in the timing. Sorry if this is what y'all were saying but "copying the Vec
into the closure" sounded more like move semantics to me than the setup function so I probably just got confused.
But if it's the case that iter_batched
is accidentally timing the setup function then that feels like a bug 🤔
Rough Code
The three benchmarks were roughly:
group.bench_function(BenchmarkId::new("find", i), |b| {
b.iter_custom(|iters| {
let mut time = std::time::Duration::new(0, 0);
for _ in 0..iters {
let t = tree.clone(); // <------ `.clone()` before we start timer
let instant = std::time::Instant::now();
let _value = black_box(t.find(i));
let elapsed = instant.elapsed();
time += elapsed;
}
time
})
});
group.bench_function(BenchmarkId::new("find_with_clone", i), |b| {
b.iter_custom(|iters| {
let mut time = std::time::Duration::new(0, 0);
for _ in 0..iters {
let instant = std::time::Instant::now();
let t = tree.clone(); // <------------ `.clone()` after we start timer
let _value = black_box(t.find(i));
let elapsed = instant.elapsed();
time += elapsed;
}
time
})
});
group.bench_function(BenchmarkId::new("find_batched", i), |b| {
b.iter_batched(
|| tree.clone(),
|t| { let _value = black_box(t.find(i)); },
criterion::BatchSize::LargeInput,
)
});
And my results were:
unbalanced/find/0 time: [36.729 ns 39.660 ns 42.969 ns]
unbalanced/find_with_clone/0
time: [1.7585 ms 1.7668 ms 1.7764 ms]
unbalanced/find_batched/0
time: [1.7780 ms 1.7853 ms 1.7941 ms]