bencher::black_box does not stop benchmark(s) from being optimized out

Open saethlin opened this issue 4 years ago • 0 comments

Hi, it looks to me like the benchmarks in this crate need a bit of love. I'd like to help but I need some input.

Currently (e209a5076a7e1af16a8c3e6e587c9faa6fc7553b) cargo bench slice tells me:

test extend_from_slice    ... bench:           6 ns/iter (+/- 1) = 85333 MB/s
test extend_with_slice    ... bench:           1 ns/iter (+/- 0) = 512000 MB/s

That doesn't seem right. Extending with an iterator over a slice shouldn't be faster than extending with a slice directly. Here is what perf tells me about extend_with_slice:

  1.86 │190:┌─→movups %xmm0,-0x134(%rbp)
       │    │extend::extend_with_slice::{{closure}}:
       │    │b.iter(|| {
       │    │v.clear();
       │    │let iter = data.iter().map(|&x| x);
       │    │v.extend(iter);
       │    │v[511]
       │    │  movzbl -0x125(%rbp),%eax
       │    │extend::extend_with_slice:
  8.26 │    │  mov    %al,-0x9(%rbp)
       │    │core::ptr::read_volatile:
       │    │  movzbl -0x9(%rbp),%eax
       │    │core::cmp::impls::<impl core::cmp::PartialOrd for u64>::lt:
 89.88 │    │  add    $0xffffffffffffffff,%rcx
       │    │<core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:
       │    └──jne    190

That looks to me like the extend is gone and this just updates some internal benchmark counter.

I cannot devise a way to prevent this optimization using bencher::black_box. If I apply the most aggressive black-boxing I can think of:

diff --git a/benches/extend.rs b/benches/extend.rs
index ba33a93..de24f57 100644
--- a/benches/extend.rs
+++ b/benches/extend.rs
@@ -37,9 +37,12 @@ fn extend_with_slice(b: &mut Bencher) {
     let mut v = ArrayVec::<u8, 512>::new();
     let data = [1; 512];
     b.iter(|| {
+        black_box(&v);
         v.clear();
+        black_box(&v);
         let iter = data.iter().map(|&x| x);
         v.extend(iter);
+        black_box(&v);
         v[511]
     });
     b.bytes = v.capacity() as u64;

If this is optimized well, we should get performance equal to extend_from_slice, but we don't. It looks to me like we end up with a loop that copies each item independently, for a ~36x regression. Ow. But if that isn't bad enough, setting codegen-units = 1 in profile.bench again lets LLVM optimize away the benchmark. And unfortunately the structure of these benchmarks forbids doing something like v = black_box(v);, but if I restructure them to accommodate that, benchmarks at dominated by bencher::black_box.

If I use the clobber-all-memory black box with this diff, no combination of codegen-units = 1, lto = true, and panic = "abort" will induce LLVM to optimize away the benchmark:

diff --git a/benches/extend.rs b/benches/extend.rs
index ba33a93..0decd87 100644
--- a/benches/extend.rs
+++ b/benches/extend.rs
@@ -1,4 +1,4 @@
-
+#![feature(bench_black_box)]
 extern crate arrayvec;
 #[macro_use] extern crate bencher;

@@ -7,7 +7,7 @@ use std::io::Write;
 use arrayvec::ArrayVec;

 use bencher::Bencher;
-use bencher::black_box;
+use core::hint::black_box;

 fn extend_with_constant(b: &mut Bencher) {
     let mut v = ArrayVec::<u8, 512>::new();
@@ -40,6 +40,7 @@ fn extend_with_slice(b: &mut Bencher) {
         v.clear();
         let iter = data.iter().map(|&x| x);
         v.extend(iter);
+        black_box(&v);
         v[511]
     });
     b.bytes = v.capacity() as u64;

So as far as I can tell, this crate needs nightly for its benchmarks to function. Does this all check out? And/or is this repo up for requiring nightly for benchmarking?

Sep 26 '21 03:09 saethlin