bencher::black_box does not stop benchmark(s) from being optimized out
Hi, it looks to me like the benchmarks in this crate need a bit of love. I'd like to help but I need some input.
Currently (e209a5076a7e1af16a8c3e6e587c9faa6fc7553b) cargo bench slice tells me:
test extend_from_slice ... bench: 6 ns/iter (+/- 1) = 85333 MB/s
test extend_with_slice ... bench: 1 ns/iter (+/- 0) = 512000 MB/s
That doesn't seem right. Extending with an iterator over a slice shouldn't be faster than extending with a slice directly. Here is what perf tells me about extend_with_slice:
1.86 │190:┌─→movups %xmm0,-0x134(%rbp)
│ │extend::extend_with_slice::{{closure}}:
│ │b.iter(|| {
│ │v.clear();
│ │let iter = data.iter().map(|&x| x);
│ │v.extend(iter);
│ │v[511]
│ │ movzbl -0x125(%rbp),%eax
│ │extend::extend_with_slice:
8.26 │ │ mov %al,-0x9(%rbp)
│ │core::ptr::read_volatile:
│ │ movzbl -0x9(%rbp),%eax
│ │core::cmp::impls::<impl core::cmp::PartialOrd for u64>::lt:
89.88 │ │ add $0xffffffffffffffff,%rcx
│ │<core::ops::range::Range<T> as core::iter::range::RangeIteratorImpl>::spec_next:
│ └──jne 190
That looks to me like the extend is gone and this just updates some internal benchmark counter.
I cannot devise a way to prevent this optimization using bencher::black_box. If I apply the most aggressive black-boxing I can think of:
diff --git a/benches/extend.rs b/benches/extend.rs
index ba33a93..de24f57 100644
--- a/benches/extend.rs
+++ b/benches/extend.rs
@@ -37,9 +37,12 @@ fn extend_with_slice(b: &mut Bencher) {
let mut v = ArrayVec::<u8, 512>::new();
let data = [1; 512];
b.iter(|| {
+ black_box(&v);
v.clear();
+ black_box(&v);
let iter = data.iter().map(|&x| x);
v.extend(iter);
+ black_box(&v);
v[511]
});
b.bytes = v.capacity() as u64;
If this is optimized well, we should get performance equal to extend_from_slice, but we don't. It looks to me like we end up with a loop that copies each item independently, for a ~36x regression. Ow. But if that isn't bad enough, setting codegen-units = 1 in profile.bench again lets LLVM optimize away the benchmark. And unfortunately the structure of these benchmarks forbids doing something like v = black_box(v);, but if I restructure them to accommodate that, benchmarks at dominated by bencher::black_box.
If I use the clobber-all-memory black box with this diff, no combination of codegen-units = 1, lto = true, and panic = "abort" will induce LLVM to optimize away the benchmark:
diff --git a/benches/extend.rs b/benches/extend.rs
index ba33a93..0decd87 100644
--- a/benches/extend.rs
+++ b/benches/extend.rs
@@ -1,4 +1,4 @@
-
+#![feature(bench_black_box)]
extern crate arrayvec;
#[macro_use] extern crate bencher;
@@ -7,7 +7,7 @@ use std::io::Write;
use arrayvec::ArrayVec;
use bencher::Bencher;
-use bencher::black_box;
+use core::hint::black_box;
fn extend_with_constant(b: &mut Bencher) {
let mut v = ArrayVec::<u8, 512>::new();
@@ -40,6 +40,7 @@ fn extend_with_slice(b: &mut Bencher) {
v.clear();
let iter = data.iter().map(|&x| x);
v.extend(iter);
+ black_box(&v);
v[511]
});
b.bytes = v.capacity() as u64;
So as far as I can tell, this crate needs nightly for its benchmarks to function. Does this all check out? And/or is this repo up for requiring nightly for benchmarking?