Performance regression for `HistoryBuf(fer)::write()` with heapless 0.9.1
Description
I've been using heapless's HistoryBuffer (now renamed to HistoryBuf) as a sliding search window with much success. More precisely, I have an iterator over a substantial number of bytes and need to find a particular sequence, so I'm doing something akin to the following:
let mut search_window: HistoryBuf<u8, 11> = HistoryBuf::new();
for byte in lots_of_bytes {
search_window.write(byte);
if search_window.oldest_ordered().eq(b"some-needle") {
// Needle found; do something useful
return;
}
}
While I'm not using heapless in its intended embedded use case, I found its HistoryBuffer to be much faster than using a VecDeque with a manual pop_front()/push_back() or similar alternatives.
However, with heapless 0.9.1, there seems to be a significant regression in the performance of the above snippet, which I'd largely attribute to the write() call. Of course, the oldest_ordered().eq() in every iteration has a proportionately larger impact on runtime than the write() on its own, but just upgrading heapless to 0.9.1 with this snippet still results in about 400 MB/s less throughput in a real-world scenario (~1.4 → ~1.0 GB/s).
Different optimization settings like LTO or codegen units do not seem to affect the result much. You can find a Divan-based benchmark as an MRE below. This may very well be a case of me holding it wrong, but I figured it would be worth reporting nonetheless.
Environment
-
CPU: Ryzen 7 PRO 5850U
-
Kernel Version: 6.16.1
-
rustc -vVrustc 1.89.0 (29483883e 2025-08-04) binary: rustc commit-hash: 29483883eed69d5fb4db01964cdf2af4d86e9cb2 commit-date: 2025-08-04 host: x86_64-unknown-linux-gnu release: 1.89.0 LLVM version: 20.1.7
MRE Benchmark
// benches/historybuf.rs
use std::hint::black_box;
use divan::counter::BytesCount;
use heapless::HistoryBuf; // Replace with `HistoryBuffer` for heapless 0.8.0
const NEEDLE: &[u8] = b"needle451";
const WINDOW_SIZE: usize = NEEDLE.len();
fn historybuf_search_window_write(data: Vec<u8>) {
let mut search_window: HistoryBuf<u8, WINDOW_SIZE> = HistoryBuf::new();
for byte in data {
search_window.write(byte);
}
}
#[divan::bench]
fn bench_historybuf_search_window_write(bencher: divan::Bencher) {
let total_bytes: usize = 8 * 1024 * 1024;
let bytes_before_needle = total_bytes - NEEDLE.len();
bencher
.counter(BytesCount::new(total_bytes))
.with_inputs(|| {
let mut data = Vec::with_capacity(total_bytes);
data.extend(vec![b'\0'; bytes_before_needle]);
data.extend(NEEDLE);
data
})
.bench_values(|data| {
historybuf_search_window_write(black_box(data));
});
}
fn main() {
divan::main();
}
# Cargo.toml
[package]
name = "heapless-perf-mre"
edition = "2024"
[dependencies]
heapless = "0.9.1"
[dev-dependencies]
divan = "0.1.21"
[profile.release]
lto = "fat"
codegen-units = 1
[[bench]]
name = "historybuf"
harness = false
Results
cargo bench --benches
heapless 0.8.0 (HistoryBuffer)
Running benches/historybuf.rs (target/release/deps/historybuf-2a85391490864b67)
Timer precision: 20 ns
historybuf fastest │ slowest │ median │ mean │ samples │ iters
╰─ bench_historybuf_search_window_write 24.61 µs │ 275.3 µs │ 119.5 µs │ 124.4 µs │ 100 │ 100
340.7 GB/s │ 30.46 GB/s │ 70.15 GB/s │ 67.37 GB/s │ │
heapless 0.9.1 (HistoryBuf)
Running benches/historybuf.rs (target/release/deps/historybuf-974f73ca8d39caa1)
Timer precision: 20 ns
historybuf fastest │ slowest │ median │ mean │ samples │ iters
╰─ bench_historybuf_search_window_write 6.823 ms │ 7.078 ms │ 6.87 ms │ 6.892 ms │ 100 │ 100
1.229 GB/s │ 1.185 GB/s │ 1.22 GB/s │ 1.217 GB/s │ │
However, with heapless 0.9.1, there seems to be a significant regression
Wow, the regression indeed is a quite severe. 😢 If we had benchmarks in the repo and ran it as part of the CI (like I do in zbus project), it would help identifying such regressions early on. Do you think you could contribute that (I'm asking cause it seems you're already half way there since you've some benchmarks)?
As for the actual issue, I've never used this part of the API but I guess someone will need to do a git bisect with the benchmarking code you provided to find out which change introduced this and hopefully the person who was responsible, will be interested in fixing this regression too. :)
do a
git bisect
Looks like this bisects to 314fd85. @sosthene-nitrokey Mind having a look? 🙂
I'd love to contribute (initial) benchmarks, but I'll have to see if I can make time for that. Quite busy at the moment. 😬
It looks like it's about optimising out the panicking branches in the slice access. Adding the following at the beginning of the write function brings back the performance to what it was before, because with 0.9.1 it appears that bound checks are not elided while they can be in 0.8.0.
if self.write_at >= self.capacity() {
unsafe { unreachable_unchecked() };
}