gitoxide Pack size regression in latest gitoxide (gix-pack 0.58 -> 0.59)

Trying to update various gix dependencies to latest, the pack size seems to have increased by ~50% in the latest release of gix-pack at least in some cases.

In my case I noticed it in a case where I create a relatively wide tree (1500 files with randomly generated names). Doing just that is sufficient to make the issue noticeable.

(the full use case actually wants to trigger a massive conflict, but this is unnecessary and irrelevant in this case)

Steps to reproduce 🕹

[package]
name = "packtest"
version = "0.1.0"
edition = "2024"

[dependencies]
# gix-actor = "*"
# gix-date = "*"
# gix-hash = "*"
# gix-hashtable = "*"
# gix-object = "*"
# gix-pack = "*"

gix-actor = "0.34.0"
gix-date = "0.9.4"
gix-hash = "0.17.0"
gix-hashtable = "=0.8.0"
gix-object = "0.48.0"
gix-pack = "0.58.0"

hex = "0.4.3"
rand = "0.9.1"

// main.rs
use std::collections::BTreeMap;
use std::io::Write as _;

use rand::Rng;
use gix_hash::ObjectId;
use gix_object::WriteTo as _;

type Store = BTreeMap::<gix_hash::ObjectId, gix_object::Object>;

fn store(store: &mut Store, object: gix_object::Object) -> ObjectId {
    let mut ser = gix_hash::io::Write::new(Vec::new(), gix_hash::Kind::Sha1);
    ser.write(&object.loose_header()).unwrap();
    object.write_to(&mut ser).unwrap();
    let oid = ser.hash.try_finalize().unwrap();
    store.insert(oid, object);
    oid
}

fn main() {
    let mut objects = Store::new();
    let oid = store(
        &mut objects,
        gix_object::Object::Blob(gix_object::BlobRef{ data: b"xoxo"}.into())
    );

    let mut files = rand::rng()
        .random_iter::<[u8;10]>()
        .map(hex::encode)
        .take(1500)
        .collect::<Vec<String>>();
    files.sort();

    let mode = gix_object::tree::EntryKind::Blob.into();
    let t0 = store(&mut objects, gix_object::Tree {
        entries: files.iter().map(|fname| gix_object::tree::Entry {
            mode,
            oid,
            filename: fname.as_str().into(),
        }).collect(),
    }.into());

    store(&mut objects, gix_object::Commit {
        tree: t0,
        parents: Default::default(),
        author: gix_actor::Signature {
            name: "foo".into(),
            email: "[email protected]".into(),
            time: gix_date::Time {
                seconds: 1747920830,
                offset: 0,
                sign: gix_date::time::Sign::Plus,
            },
        },
        committer: gix_actor::Signature {
            name: "foo".into(),
            email: "[email protected]".into(),
            time: gix_date::Time {
                seconds: 1747920830,
                offset: 0,
                sign: gix_date::time::Sign::Plus,
            },
        },
        encoding: None,
        message: "xxx".into(),
        extra_headers: Vec::new(),
    }.into());

    // let mut fs1: Vec<_> = files.iter().map(|fname| gix_object::tree::Entry {
    //     mode,
    //     oid,
    //     filename: fname.as_str().into(),
    // }).collect();
    // fs1.push(gix_object::tree::Entry { mode, oid, filename: "a".into()});
    // fs1.sort();
    // let _t2 = store(&mut objects, gix_object::Tree { entries: fs1 }.into());

    let count = objects.len();
    let mut pack = Vec::new();
    let mut entries_writer = gix_pack::data::output::bytes::FromEntriesIter::new(
        objects.into_iter().map(
            |(oid, obj)| -> Result<Vec<_>, gix_pack::data::output::entry::Error> {
                use gix_pack::data::output::{Entry, Count};
                let kind = obj.kind();
                let mut buf = Vec::new();
                obj.write_to(&mut buf).unwrap();
                Ok(vec![Entry::from_data(
                    &Count::from_data(oid, None),
                    &gix_object::Data::new(kind, &buf),
                )?])
            }
        ),
        &mut pack,
        count.try_into().unwrap(),
        gix_pack::data::Version::V2,
        gix_hash::Kind::Sha1,
    );
    for e in entries_writer.by_ref() {
        e.unwrap();
    }
    entries_writer.digest().unwrap();
    println!("{}", pack.len());
}

Running locally I get a pack size of 19375, uncommenting the "latest" packages, commenting the "old" packages and sign lines, updating, and re-running, I get a pack size of 29737.

Uncommenting the bit which adds a second tree yields respectively 38548 and 59395 which is why I originally thought this was a delta-ification issue, but these sizes are about twice the size of the corresponding single tree version so delta-ification looks like it has never done much in this case.

Git behavior

Creating the same repository in git (with commits and branches to keep the objects alive), including the second tree, yields a pack that's about 20k. Here's a bundle you can check:

out.bundle.gz

May 23 '25 06:05 xmo-odoo

Thanks a lot for sharing!

As the gix-pack implementation didn't change, I presume this might have something to do with switching to zlib-rs as backend. Maybe it compresses differently with different compression settings.

https://github.com/GitoxideLabs/gitoxide/blob/0fa04bcbdf3102c5435e64cfef894a1bfc8d6e7b/gix-features/src/zlib/stream/deflate/mod.rs#L34-L36

Could it be that fast has much worse compression than zlib-ng for instance?

PS: I assume the unit of numbers, like 20k or 29737, are bytes.

CC @folkertdev

May 23 '25 06:05 Byron

Could it be that fast has much worse compression than zlib-ng for instance?

That might make sense. Although FWIW testing combinations of wbits and compression level using Python's stdlib there is not setting which yields anywhere near 30kB, from a raw tree of 72000 bytes I get (all numbers in bytes):

level	wbits 9	wbits 10	wbits 11	wbits 12	wbits 13	wbits 14	wbits 15
1	17946	18544	19213	19960	20746	21232	21703
2	17946	18544	19213	19960	20741	21230	21693
3	17900	18469	19127	19797	20476	20958	21427
4	20725	20356	20163	20763	20796	20884	21156
5	17794	18409	19064	19723	19909	20096	20328
6	17770	18389	19039	19709	19898	20082	20314
7	17768	18348	19011	19684	19876	20057	20276
8	17768	18327	18925	19586	19802	20003	20236
9	17768	18327	18925	19548	19739	19960	20201

Although the pack itself adds some overhead, and obviously contains more than just the "big tree".

PS: I assume the unit of numbers, like 20k or 29737, are bytes.

That is correct, sorry for not specifying it.

May 23 '25 07:05 xmo-odoo

Could it be that fast has much worse compression than zlib-ng for instance?

FWIW assuming the various zlib features of gix-features impact this, I'm not seeing anything significant when switching the various zlib backends, all of them yield around 29700 bytes (with some jitter).

I just tested things by adding gix-features = { version = "*", features = [...] } to the manifest though, which might not be correct?

May 23 '25 08:05 xmo-odoo

It would be extremely surprising if zlib-rs caused this: we produce byte-for-byte equivalent output to zlib-ng for each compression level.

May 23 '25 08:05 folkertdev

Thinking a bit I figured the pack itself might cause issues or interfere so I stripped the script to just generating an Entry from the tree and log the size:

// main.rs
use std::collections::BTreeMap;
use std::io::Write as _;

use rand::Rng;
use gix_hash::ObjectId;
use gix_object::WriteTo as _;

type Store = BTreeMap::<gix_hash::ObjectId, gix_object::Object>;

fn store(store: &mut Store, object: gix_object::Object) -> ObjectId {
    let mut ser = gix_hash::io::Write::new(Vec::new(), gix_hash::Kind::Sha1);
    ser.write(&object.loose_header()).unwrap();
    object.write_to(&mut ser).unwrap();
    let oid = ser.hash.try_finalize().unwrap();
    store.insert(oid, object);
    oid
}

fn main() {
    let mut objects = Store::new();
    let oid = store(
        &mut objects,
        gix_object::Object::Blob(gix_object::BlobRef{ data: b"xoxo"}.into())
    );

    let mut files = rand::rng()
        .random_iter::<[u8;10]>()
        .map(hex::encode)
        .take(1500)
        .collect::<Vec<String>>();
    files.sort();

    let mode = gix_object::tree::EntryKind::Blob.into();
    let t0 = store(&mut objects, gix_object::Tree {
        entries: files.iter().map(|fname| gix_object::tree::Entry {
            mode,
            oid,
            filename: fname.as_str().into(),
        }).collect(),
    }.into());

    let obj = &objects[&t0];
    let mut buf = Vec::new();
    obj.write_to(&mut buf).unwrap();
    let e = gix_pack::data::output::Entry::from_data(
        &gix_pack::data::output::Count::from_data(t0, None),
        &gix_object::Data::new(obj.kind(), &buf),
    ).unwrap();

    println!("compressed size {}", e.compressed_data.len());
    println!("decompressed size {}", e.decompressed_size);
}

This yields ~19200 bytes in the old version, and ~29500 in the new version. Which confirms that it's not an issue with the pack itself (and simplifies the test case).

May 23 '25 08:05 xmo-odoo

That's great to see that zlib-rs can't be involved in terms of compression, thanks for chiming in @folkertdev.

FWIW assuming the various zlib features of gix-features impact this, I'm not seeing anything significant when switching the various zlib backends, all of them yield around 29700 bytes (with some jitter).

I just tested things by adding gix-features = { version = "*", features = [...] } to the manifest though, which might not be correct?

Just for completeness I want to add that these flags have no effect anymore, they are scheduled for removal this month.

Thinking a bit I figured the pack itself might cause issues or interfere so I stripped the script to just generating an Entry from the tree and log the size:

Maybe with this script you can git bisect over the codebase, which should yield the commit that breaks it with certainty.

My new best guess is that maybe the miniz_oxide backend was producer stronger compression by default?

May 26 '25 03:05 Byron

Maybe with this script you can git bisect over the codebase, which should yield the commit that breaks it with certainty.

Updating the script to not depend on third party packages and succeed/fail with a breakpoint at 25000 (< 25000 = old, >25000 = new), it points to 96164c5936032b4edb973828178cc55793dd57cc which is the zlib-rs-as-default commit.

bisection script

// gix-pack/examples/entrysize.rs
use std::collections::BTreeMap;
use std::io::{Read as _, Write as _};

use gix_hash::ObjectId;
use gix_object::WriteTo as _;

type Store = BTreeMap::<gix_hash::ObjectId, gix_object::Object>;

fn store(store: &mut Store, object: gix_object::Object) -> ObjectId {
    let mut ser = gix_hash::io::Write::new(Vec::new(), gix_hash::Kind::Sha1);
    ser.write(&object.loose_header()).unwrap();
    object.write_to(&mut ser).unwrap();
    let oid = ser.hash.try_finalize().unwrap();
    store.insert(oid, object);
    oid
}

fn hexencode(b: &[u8]) -> String {
    b.iter().map(|b| format!("{b:x}")).collect()

}

fn main() {
    let mut rng = std::fs::File::open("/dev/urandom").unwrap();

    let mut objects = Store::new();
    let oid = store(
        &mut objects,
        gix_object::Object::Blob(gix_object::BlobRef{ data: b"xoxo"}.into())
    );

    let mut files = (0..)
        .map(|_| {
            let mut buf = [0;10];
            rng.read_exact(&mut buf).unwrap();
            hexencode(&buf)
        })
        .take(1500)
        .collect::<Vec<String>>();
    files.sort();

    let mode = gix_object::tree::EntryKind::Blob.into();
    let t0 = store(&mut objects, gix_object::Tree {
        entries: files.iter().map(|fname| gix_object::tree::Entry {
            mode,
            oid,
            filename: fname.as_str().into(),
        }).collect(),
    }.into());

    let obj = &objects[&t0];
    let mut buf = Vec::new();
    obj.write_to(&mut buf).unwrap();
    let e = gix_pack::data::output::Entry::from_data(
        &gix_pack::data::output::Count::from_data(t0, None),
        &gix_object::Data::new(obj.kind(), &buf),
    ).unwrap();

    println!("compressed size {}", e.compressed_data.len());
    println!("decompressed size {}", e.decompressed_size);
    std::process::exit((e.compressed_data.len() > 25000).into());
}

May 26 '25 06:05 xmo-odoo

If the previous default was miniz_oxide, that does make sense: zlib-rs and zlib-ng have a much faster level 1, but it also compresses much worse. Generally, zlib-rs compression level 2 should be both faster and compress better than stock zlib at level 1 (and I'd guess miniz_oxide follows the stock zlib compression levels). See also

https://github.com/trifectatechfoundation/zlib-rs/tree/main/libz-rs-sys#compression-levels

May 26 '25 07:05 folkertdev

I guess in that case the solution would be to compress the data on the application side and create the Entry by hand instead of using the from_data helper? (it's not like that's hugely difficult).

May 26 '25 08:05 xmo-odoo

That's great to hear! Thanks also for the reproduction script!

I had a feeling that this might be the default compression settings which are different in miniz_oxide.

Looking at the Git source revealed that the issue ultimately stems from Gitoxide relying on a single hard-coded value, despite Git being more complex here.

For instance, there is a core.compression configuration value that we'd have to start passing through.

For starters, one might be able to change the compression level to Compression::default(), which is what Git does by default. Of course, through flate2 the level then is equal to 6 instead of -1 to leave it to the ZIP implementation, but it's worth a shot. Edit: fast() might have been an artefact from miniz_oxide times, where one might not have been able to afford better (or default) compression.

Could @xmo-odoo help?

May 26 '25 08:05 Byron

That's great to hear! Thanks also for the reproduction script!

I had a feeling that this might be the default compression settings which are different in miniz_oxide.

Looking at the Git source revealed that the issue ultimately stems from Gitoxide relying on a single hard-coded value, despite Git being more complex here.

For instance, there is a core.compression configuration value that we'd have to start passing through.

For starters, one might be able to change the compression level to Compression::default(), which is what Git does by default. Of course, through flate2 the level then is equal to 6 instead of -1 to leave it to the ZIP implementation, but it's worth a shot. Edit: fast() might have been an artefact from miniz_oxide times, where one might not have been able to afford better (or default) compression.

core.loosecompression and pack.compression are also likely values of interest.

According to the git-config docs:

core.loosecompression defaults to core.compression, or 1 (best speed) if that's not set either
pack.compression defaults to core.compression or -1 (default ~ 6 in standard zlib) if that's not set either

So it might be better to remove the default compression level entirely, and require the callers to pass in a compression level explicitely? Tests aside there are only 3 which makes sense:

gix-pack's output::Entry
gix-odb's loose object store
gix-odb's Sink (I'll admit I'm not entirely sure why, benchmarking?)

Could xmo-odoo help?

Maybe? I mostly use the lower level packages so I don't really know the organisation of the CLI. But I could probably update the APIs from the bottom to add support for customisable compression and follow the compilation errors until it seems to make sense to stop and retrieve configuration data.

May 26 '25 10:05 xmo-odoo

core.loosecompression and pack.compression are also likely values of interest.

Definitely.

So it might be better to remove the default compression level entirely, and require the callers to pass in a compression level explicitely? Tests aside there are only 3 which makes sense:

Yes, that's the plan for the correct implementation.

However, I am open to set the default value to something in between that adjusts for zlib-rs higher performance, as a better trade-off between speed and compression ratio. Such a thing would be trivial to implement, but I'd certainly also welcome the full implementation that forces callers to pass the compression level.

May 26 '25 10:05 Byron

gitoxide gitoxide copied to clipboard

Pack size regression in latest gitoxide (gix-pack 0.58 -> 0.59)

Steps to reproduce 🕹

Git behavior

gitoxide
gitoxide copied to clipboard