criterion.rs iter_batched overhead

iter_batched overhead

Open franziskuskiefer opened this issue 3 years ago • 4 comments

While doing some measurements I was thrown off by the way they changed depending on the setup. According to the docs iter_batched "[t]imes a routine that requires some input by generating a batch of input, then timing the iteration of the benchmark over the input." The example is a simple sort b.iter_batched(|| data.clone(), |mut data| sort(&mut data), BatchSize::SmallInput).

Running an example like this on large inputs however yields a large overhead depending on the size of the input data.

fn benchmarks(c: &mut Criterion) {
    for &n in [8000, 16000, 24000].iter() {
        let data = vec![0u8; n];
        c.bench_function(&format!("Operation on vec of size {}", n), |b| {
            b.iter_batched(|| data.clone(), |_data| {}, BatchSize::SmallInput)
        });
    }
}

Running this snipet gives me

Operation on vec of size 8000                                                                             
                        time:   [405.16 ns 418.89 ns 430.70 ns]
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) low severe

Operation on vec of size 16000                                                                             
                        time:   [684.50 ns 697.79 ns 709.61 ns]
Found 20 outliers among 100 measurements (20.00%)
  20 (20.00%) low mild

Operation on vec of size 24000                                                                             
                        time:   [1.0689 us 1.0979 us 1.1206 us]
Found 19 outliers among 100 measurements (19.00%)
  19 (19.00%) low mild

This appears to be a pretty big overhead, which makes it impossible to measure small operations.

May 05 '21 11:05 franziskuskiefer

This seems to be the overhead of copying the Vec into the closure |_data| {}.

In short, I think you want to use iter_batched_ref, unless you absolutely must pass the vector by value.

Try these benchmarks...

use criterion::{criterion_group, criterion_main, BatchSize, Criterion};

criterion_main!(tests);
criterion_group!(
    tests,
    benchmarks,
    benchmarks_pass_ref,
    benchmarks_ref
);

fn benchmarks(c: &mut Criterion) {
    for &n in [8000, 16000, 24000].iter() {
        let data = vec![0u8; n];
        c.bench_function(&format!("batched: Operation on vec of size {}", n), |b| {
            b.iter_batched(|| data.clone(), |_data| {}, BatchSize::SmallInput)
        });
    }
}

fn benchmarks_pass_ref(c: &mut Criterion) {
    for &n in [8000, 16000, 24000].iter() {
        let data = vec![0u8; n];
        c.bench_function(
            &format!("batched: Operation on ref to vec of size {}", n),
            |b| b.iter_batched(|| &data, |_data| {}, BatchSize::SmallInput),
        );
    }
}

fn benchmarks_ref(c: &mut Criterion) {
    for &n in [8000, 16000, 24000].iter() {
        let data = vec![0u8; n];
        c.bench_function(
            &format!("batched_ref: Operation on vec of size {}", n),
            |b| b.iter_batched_ref(|| data.clone(), |_data| {}, BatchSize::SmallInput),
        );
    }
}

Doing this I get:

batched: Operation on vec of size 8000
                        time:   [390.65 ns 458.90 ns 533.21 ns]

batched: Operation on vec of size 16000
                        time:   [691.45 ns 768.81 ns 844.51 ns]

batched: Operation on vec of size 24000
                        time:   [818.16 ns 878.17 ns 943.36 ns]

batched: Operation on ref to vec of size 8000
                        time:   [725.79 ps 757.43 ps 801.51 ps]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

batched: Operation on ref to vec of size 16000
                        time:   [745.11 ps 798.38 ps 855.88 ps]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

batched: Operation on ref to vec of size 24000
                        time:   [753.82 ps 805.82 ps 862.83 ps]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

batched_ref: Operation on vec of size 8000
                        time:   [8.7324 ps 9.4074 ps 10.290 ps]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

batched_ref: Operation on vec of size 16000
                        time:   [18.038 ps 19.569 ps 21.523 ps]
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

batched_ref: Operation on vec of size 24000
                        time:   [26.565 ps 28.632 ps 31.292 ps]
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

May 05 '21 14:05 rnleach

Yeah I also think that it's coming from copying the vector. It's still a little unexpected. The ref version is definitely better but not necessarily what I want to use if the setup information really has to be copied/fresh.

So I think what I want to say is that I find the behavior a little unexpected and the documentation doesn't really mention the problem. In fact, it suggests to use it like this. Maybe it would be good to extend the documentation in a way that makes it clear that the measurements include the time needed to copy the setup into the routine. The description currently gives the following timing model.

elapsed = (Instant::now * num_batches) + (iters * (routine + O::drop)) + Vec::extend

May 06 '21 07:05 franziskuskiefer

I actually encountered a similar unexpected result. Consider the following benchmark:

pub fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("criterion");
    let len = 1500;
    let mut node = vec![0_u8; len];
    let data = vec![42_u8; 10*len];
    group.bench_function("no batch",
        |b| b.iter(|| change(&data, &mut node, len)));
    group.bench_function("BatchSize::LargeInput",
        |b| b.iter_batched_ref(|| data.clone(),
            |data| change(&data, &mut node, len), BatchSize::LargeInput));
    group.bench_function("BatchSize::SmallInput",
        |b| b.iter_batched_ref(|| data.clone(),
            |data| change(&data, &mut node, len), BatchSize::SmallInput));
    group.finish();
}

pub fn change(data: &[u8], node: &mut [u8], len: usize) {
    for curr in data.chunks_exact(len) {
        for (a,b) in node.iter_mut().zip(curr.iter()) {
            *a ^= b;
        }
    }
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

From my understanding of the documentation, BatchSize::SmallInput should incur less measurement overhead than BatchSize::LargeInput, but indeed I get the opposite behavior:

criterion/no batch      time:   [554.98 ns 556.80 ns 558.85 ns]                                
criterion/BatchSize::LargeInput                                                                             
                        time:   [668.32 ns 673.33 ns 677.91 ns]
criterion/BatchSize::SmallInput                                                                             
                        time:   [1.1419 us 1.1478 us 1.1550 us]

Do you have an idea why this could be the case? I was wondering if perhaps the data clones are implicitly dropped in each iteration and tried to explicitly return them to ensure "drop collection" as with iter_with_large_drop but that resulted in a lifetime error.

May 18 '21 18:05 ialokim

I was seeing similar issues so I tried running the benchmarks 3 ways:

iter_custom with the clone outside the timing
iter_custom with the clone inside the timing
iter_batched

And it looked like the second and third timings matched and the first was much "faster". So it seems like iter_batched is including the setup function in the timing. Sorry if this is what y'all were saying but "copying the Vec into the closure" sounded more like move semantics to me than the setup function so I probably just got confused.

But if it's the case that iter_batched is accidentally timing the setup function then that feels like a bug 🤔

Rough Code

The three benchmarks were roughly:

group.bench_function(BenchmarkId::new("find", i), |b| {
    b.iter_custom(|iters| {
        let mut time = std::time::Duration::new(0, 0);
        for _ in 0..iters {
            let t = tree.clone(); // <------ `.clone()` before we start timer
            let instant = std::time::Instant::now();
            let _value = black_box(t.find(i));
            let elapsed = instant.elapsed();
            time += elapsed;
        }
        time
    })
});
group.bench_function(BenchmarkId::new("find_with_clone", i), |b| {
    b.iter_custom(|iters| {
        let mut time = std::time::Duration::new(0, 0);
        for _ in 0..iters {
            let instant = std::time::Instant::now();
            let t = tree.clone(); // <------------ `.clone()` after we start timer
            let _value = black_box(t.find(i));
            let elapsed = instant.elapsed();
            time += elapsed;
        }
        time
    })
});
group.bench_function(BenchmarkId::new("find_batched", i), |b| {
    b.iter_batched(
        || tree.clone(),
        |t| { let _value = black_box(t.find(i)); },
        criterion::BatchSize::LargeInput,
    )
});

And my results were:

unbalanced/find/0       time:   [36.729 ns 39.660 ns 42.969 ns]
unbalanced/find_with_clone/0
                        time:   [1.7585 ms 1.7668 ms 1.7764 ms]
unbalanced/find_batched/0
                        time:   [1.7780 ms 1.7853 ms 1.7941 ms]

Oct 20 '21 01:10 mlodato517

criterion.rs criterion.rs copied to clipboard

iter_batched overhead

criterion.rs
criterion.rs copied to clipboard