imara-diff Add benchmarking infrastructure

This crate uses Vec<bool> as its primary data structure on which diffs are performed: https://github.com/pascalkuthe/imara-diff/blob/055f8e39efa52cf20ba1632b1f3437b6e0eb49cf/src/lib.rs#L231-L232 This needs 8 bits of space for every 1 bit of semantic information. We could technically reduce the memory consumption of these data by ~8X by employing a bit vector.

I would assume that a drastic reduction in memory consumption will make the algorithm perform a lot better. The overhead of the added bit operations is likely negligible in contrast to the reduction of cache misses.

Is this something you have looked into? If yes, can you outline why you decided against using bit vectors?

As the set of operations on this data structure is fairly limited in the context of this crate, it should be straightforward to implement a custom version of this directly in this crate. Alternatively, it is possible to an existing solution like https://crates.io/crates/bitvec.

Please let me know if I am missing something obvious.

Sep 27 '25 15:09 KnorpelSenf

I'd definitely love this! One could write a test with a custom allocator, one of them allows to obtain debugging information to measure memory usage. With that, one could port it over to use bitvec and measure both time and memory. Under the assumption that the bitvec version is at least as fast, but uses less memory, then one probably has a basis to discuss if picking up a new dependency is worth it.

Sep 27 '25 18:09 Byron

I agree that we should measure before optimising this. I also think that the benchmarking setup contains a lot more complexity than a new implementation based on bit vectors.

Sep 28 '25 18:09 KnorpelSenf

Here is an example for the allocator-based setup, copied from gitoxide.

use std::{alloc, time::Instant};

use bytesize::ByteSize;
use cap::Cap;

#[global_allocator]
static ALLOCATOR: Cap<alloc::System> = Cap::new(alloc::System, usize::MAX);

#[test]
fn usage() {
    let before = ALLOCATOR.allocated();
    let start = Instant::now();
    let config = gix_config::File::from_bytes_no_includes(
        include_bytes!("fixtures/fuzzed/mem-amplification.config"),
        gix_config::file::Metadata::from(gix_config::Source::User),
        Default::default(),
    )
    .unwrap();
    let after = ALLOCATOR.allocated();
    let used = after - before;
    let elapsed = start.elapsed().as_secs_f32();
    eprintln!(
        "used mem: {}B for {} sections, took {elapsed:.02}s [total-mem: {total}, peak-mem: {peak}]",
        ByteSize(used as u64),
        config.sections().count(),
        total = ByteSize(ALLOCATOR.total_allocated() as u64),
        peak = ByteSize(ALLOCATOR.max_allocated() as u64),
    );
    assert!(
        used < 200 * 1024 * 1024,
        "we should now start using more memory than anticipated, to keep mem-amplification low"
    );
}

Maybe there is real-world diffs that can be profiled instead, as long as there is something reproducible to compare performance with both in terms of time and memory.

Sep 28 '25 19:09 Byron

Thank you so much!

Maybe there is real-world diffs that can be profiled instead, as long as there is something reproducible to compare performance with both in terms of time and memory.

That's a great point. I think I have enough example files ready to test things. I will see if I can spend some time on a benchmarking setup.

By the way, where is the code that was used for the benchmarks in the README file? Can't that just be reused here?

Sep 28 '25 19:09 KnorpelSenf

That's a great point. I think I have enough example files ready to test things. I will see if I can spend some time on a benchmarking setup.

And these benchmarks, I think, could be a contribution on their own as they would enable any future performance optimisation as well 🎉.

By the way, where is the code that was used for the benchmarks in the README file? Can't that just be reused here?

I assume it's based on this file here. Maybe it can be brought back to life in an improved fashion.

Last but not least, I'd also think that real-world benches are an option. Create a binary that runs a diff on two files, maybe even many times to compensate for IO-overhead, and run this with /usr/bin/time -lp to get memory statistics. Running benchmarks on CI or making them executable is great, but finding fringe cases in the real-world with a binary and manual exploration is also valuable.

When arguing for bitvec here I'd also think a reproducible manual benchmark is good enough, even though an easy-to-run auto-benchmark is definitely preferred (but also not required). Such a binary could live in ./examples/ and be compiled with cargo build --example 'foo' --release.

Sep 29 '25 02:09 Byron

Before investing too much into the bitvec: I did consider this but never invested into this because I don't think the memory savings will be particularly significant. I reduces one byte per token to 1/8 byte per token (so reducing by 7/8*tokens byte instead). The interner alone stores 4 bytes (u32) per token. You would have to have a seriously huge file before either of that maters. As a token is usually a line (a char/byte diff on the entire file is a terrible idea anyway) that would mean that for example if the average line length is 20 bytes you would only save about 4MIB for a 100MIB file (while the internet would be 20MiB).

There is something to be said for improved cache locality but it does add a lot of complexity because the vec of bool is deeply ingrained in the Myrs diff implementation. It would be a pretty complicated and large change that would need to be well motivated. I can also imagine that it wouldn't actually be faster as it does mean that extra ops for bitmasking are needed.

Probably benchmarks are a good way to start either way

Oct 01 '25 08:10 pascalkuthe

Another thing that might be interesting to look into as soon as we have benchmarking infrastructure available is to use the fnv crate instead of the default sip hashing function. This could make interning the strings a lot faster, but given that we likely do not spend a substantial amount of time on the interning step, it could be another optimisation of the wrong thing.

Nov 17 '25 13:11 KnorpelSenf