gitoxide icon indicating copy to clipboard operation
gitoxide copied to clipboard

Pack size regression in latest gitoxide (gix-pack 0.58 -> 0.59)

Open xmo-odoo opened this issue 6 months ago • 12 comments

Trying to update various gix dependencies to latest, the pack size seems to have increased by ~50% in the latest release of gix-pack at least in some cases.

In my case I noticed it in a case where I create a relatively wide tree (1500 files with randomly generated names). Doing just that is sufficient to make the issue noticeable.

(the full use case actually wants to trigger a massive conflict, but this is unnecessary and irrelevant in this case)

Steps to reproduce 🕹

[package]
name = "packtest"
version = "0.1.0"
edition = "2024"

[dependencies]
# gix-actor = "*"
# gix-date = "*"
# gix-hash = "*"
# gix-hashtable = "*"
# gix-object = "*"
# gix-pack = "*"

gix-actor = "0.34.0"
gix-date = "0.9.4"
gix-hash = "0.17.0"
gix-hashtable = "=0.8.0"
gix-object = "0.48.0"
gix-pack = "0.58.0"

hex = "0.4.3"
rand = "0.9.1"
// main.rs
use std::collections::BTreeMap;
use std::io::Write as _;

use rand::Rng;
use gix_hash::ObjectId;
use gix_object::WriteTo as _;

type Store = BTreeMap::<gix_hash::ObjectId, gix_object::Object>;

fn store(store: &mut Store, object: gix_object::Object) -> ObjectId {
    let mut ser = gix_hash::io::Write::new(Vec::new(), gix_hash::Kind::Sha1);
    ser.write(&object.loose_header()).unwrap();
    object.write_to(&mut ser).unwrap();
    let oid = ser.hash.try_finalize().unwrap();
    store.insert(oid, object);
    oid
}

fn main() {
    let mut objects = Store::new();
    let oid = store(
        &mut objects,
        gix_object::Object::Blob(gix_object::BlobRef{ data: b"xoxo"}.into())
    );

    let mut files = rand::rng()
        .random_iter::<[u8;10]>()
        .map(hex::encode)
        .take(1500)
        .collect::<Vec<String>>();
    files.sort();

    let mode = gix_object::tree::EntryKind::Blob.into();
    let t0 = store(&mut objects, gix_object::Tree {
        entries: files.iter().map(|fname| gix_object::tree::Entry {
            mode,
            oid,
            filename: fname.as_str().into(),
        }).collect(),
    }.into());

    store(&mut objects, gix_object::Commit {
        tree: t0,
        parents: Default::default(),
        author: gix_actor::Signature {
            name: "foo".into(),
            email: "[email protected]".into(),
            time: gix_date::Time {
                seconds: 1747920830,
                offset: 0,
                sign: gix_date::time::Sign::Plus,
            },
        },
        committer: gix_actor::Signature {
            name: "foo".into(),
            email: "[email protected]".into(),
            time: gix_date::Time {
                seconds: 1747920830,
                offset: 0,
                sign: gix_date::time::Sign::Plus,
            },
        },
        encoding: None,
        message: "xxx".into(),
        extra_headers: Vec::new(),
    }.into());

    // let mut fs1: Vec<_> = files.iter().map(|fname| gix_object::tree::Entry {
    //     mode,
    //     oid,
    //     filename: fname.as_str().into(),
    // }).collect();
    // fs1.push(gix_object::tree::Entry { mode, oid, filename: "a".into()});
    // fs1.sort();
    // let _t2 = store(&mut objects, gix_object::Tree { entries: fs1 }.into());

    let count = objects.len();
    let mut pack = Vec::new();
    let mut entries_writer = gix_pack::data::output::bytes::FromEntriesIter::new(
        objects.into_iter().map(
            |(oid, obj)| -> Result<Vec<_>, gix_pack::data::output::entry::Error> {
                use gix_pack::data::output::{Entry, Count};
                let kind = obj.kind();
                let mut buf = Vec::new();
                obj.write_to(&mut buf).unwrap();
                Ok(vec![Entry::from_data(
                    &Count::from_data(oid, None),
                    &gix_object::Data::new(kind, &buf),
                )?])
            }
        ),
        &mut pack,
        count.try_into().unwrap(),
        gix_pack::data::Version::V2,
        gix_hash::Kind::Sha1,
    );
    for e in entries_writer.by_ref() {
        e.unwrap();
    }
    entries_writer.digest().unwrap();
    println!("{}", pack.len());
}

Running locally I get a pack size of 19375, uncommenting the "latest" packages, commenting the "old" packages and sign lines, updating, and re-running, I get a pack size of 29737.

Uncommenting the bit which adds a second tree yields respectively 38548 and 59395 which is why I originally thought this was a delta-ification issue, but these sizes are about twice the size of the corresponding single tree version so delta-ification looks like it has never done much in this case.

Git behavior

Creating the same repository in git (with commits and branches to keep the objects alive), including the second tree, yields a pack that's about 20k. Here's a bundle you can check:

out.bundle.gz

xmo-odoo avatar May 23 '25 06:05 xmo-odoo

Thanks a lot for sharing!

As the gix-pack implementation didn't change, I presume this might have something to do with switching to zlib-rs as backend. Maybe it compresses differently with different compression settings.

https://github.com/GitoxideLabs/gitoxide/blob/0fa04bcbdf3102c5435e64cfef894a1bfc8d6e7b/gix-features/src/zlib/stream/deflate/mod.rs#L34-L36

Could it be that fast has much worse compression than zlib-ng for instance?

PS: I assume the unit of numbers, like 20k or 29737, are bytes.

CC @folkertdev

Byron avatar May 23 '25 06:05 Byron

Could it be that fast has much worse compression than zlib-ng for instance?

That might make sense. Although FWIW testing combinations of wbits and compression level using Python's stdlib there is not setting which yields anywhere near 30kB, from a raw tree of 72000 bytes I get (all numbers in bytes):

level wbits 9 wbits 10 wbits 11 wbits 12 wbits 13 wbits 14 wbits 15
1 17946 18544 19213 19960 20746 21232 21703
2 17946 18544 19213 19960 20741 21230 21693
3 17900 18469 19127 19797 20476 20958 21427
4 20725 20356 20163 20763 20796 20884 21156
5 17794 18409 19064 19723 19909 20096 20328
6 17770 18389 19039 19709 19898 20082 20314
7 17768 18348 19011 19684 19876 20057 20276
8 17768 18327 18925 19586 19802 20003 20236
9 17768 18327 18925 19548 19739 19960 20201

Although the pack itself adds some overhead, and obviously contains more than just the "big tree".

PS: I assume the unit of numbers, like 20k or 29737, are bytes.

That is correct, sorry for not specifying it.

xmo-odoo avatar May 23 '25 07:05 xmo-odoo

Could it be that fast has much worse compression than zlib-ng for instance?

FWIW assuming the various zlib features of gix-features impact this, I'm not seeing anything significant when switching the various zlib backends, all of them yield around 29700 bytes (with some jitter).

I just tested things by adding gix-features = { version = "*", features = [...] } to the manifest though, which might not be correct?

xmo-odoo avatar May 23 '25 08:05 xmo-odoo

It would be extremely surprising if zlib-rs caused this: we produce byte-for-byte equivalent output to zlib-ng for each compression level.

folkertdev avatar May 23 '25 08:05 folkertdev

Thinking a bit I figured the pack itself might cause issues or interfere so I stripped the script to just generating an Entry from the tree and log the size:

// main.rs
use std::collections::BTreeMap;
use std::io::Write as _;

use rand::Rng;
use gix_hash::ObjectId;
use gix_object::WriteTo as _;

type Store = BTreeMap::<gix_hash::ObjectId, gix_object::Object>;

fn store(store: &mut Store, object: gix_object::Object) -> ObjectId {
    let mut ser = gix_hash::io::Write::new(Vec::new(), gix_hash::Kind::Sha1);
    ser.write(&object.loose_header()).unwrap();
    object.write_to(&mut ser).unwrap();
    let oid = ser.hash.try_finalize().unwrap();
    store.insert(oid, object);
    oid
}

fn main() {
    let mut objects = Store::new();
    let oid = store(
        &mut objects,
        gix_object::Object::Blob(gix_object::BlobRef{ data: b"xoxo"}.into())
    );

    let mut files = rand::rng()
        .random_iter::<[u8;10]>()
        .map(hex::encode)
        .take(1500)
        .collect::<Vec<String>>();
    files.sort();

    let mode = gix_object::tree::EntryKind::Blob.into();
    let t0 = store(&mut objects, gix_object::Tree {
        entries: files.iter().map(|fname| gix_object::tree::Entry {
            mode,
            oid,
            filename: fname.as_str().into(),
        }).collect(),
    }.into());

    let obj = &objects[&t0];
    let mut buf = Vec::new();
    obj.write_to(&mut buf).unwrap();
    let e = gix_pack::data::output::Entry::from_data(
        &gix_pack::data::output::Count::from_data(t0, None),
        &gix_object::Data::new(obj.kind(), &buf),
    ).unwrap();

    println!("compressed size {}", e.compressed_data.len());
    println!("decompressed size {}", e.decompressed_size);
}

This yields ~19200 bytes in the old version, and ~29500 in the new version. Which confirms that it's not an issue with the pack itself (and simplifies the test case).

xmo-odoo avatar May 23 '25 08:05 xmo-odoo

That's great to see that zlib-rs can't be involved in terms of compression, thanks for chiming in @folkertdev.

FWIW assuming the various zlib features of gix-features impact this, I'm not seeing anything significant when switching the various zlib backends, all of them yield around 29700 bytes (with some jitter).

I just tested things by adding gix-features = { version = "*", features = [...] } to the manifest though, which might not be correct?

Just for completeness I want to add that these flags have no effect anymore, they are scheduled for removal this month.

Thinking a bit I figured the pack itself might cause issues or interfere so I stripped the script to just generating an Entry from the tree and log the size:

Maybe with this script you can git bisect over the codebase, which should yield the commit that breaks it with certainty.

My new best guess is that maybe the miniz_oxide backend was producer stronger compression by default?

Byron avatar May 26 '25 03:05 Byron

Maybe with this script you can git bisect over the codebase, which should yield the commit that breaks it with certainty.

Updating the script to not depend on third party packages and succeed/fail with a breakpoint at 25000 (< 25000 = old, >25000 = new), it points to 96164c5936032b4edb973828178cc55793dd57cc which is the zlib-rs-as-default commit.

bisection script
// gix-pack/examples/entrysize.rs
use std::collections::BTreeMap;
use std::io::{Read as _, Write as _};

use gix_hash::ObjectId;
use gix_object::WriteTo as _;

type Store = BTreeMap::<gix_hash::ObjectId, gix_object::Object>;

fn store(store: &mut Store, object: gix_object::Object) -> ObjectId {
    let mut ser = gix_hash::io::Write::new(Vec::new(), gix_hash::Kind::Sha1);
    ser.write(&object.loose_header()).unwrap();
    object.write_to(&mut ser).unwrap();
    let oid = ser.hash.try_finalize().unwrap();
    store.insert(oid, object);
    oid
}

fn hexencode(b: &[u8]) -> String {
    b.iter().map(|b| format!("{b:x}")).collect()

}

fn main() {
    let mut rng = std::fs::File::open("/dev/urandom").unwrap();

    let mut objects = Store::new();
    let oid = store(
        &mut objects,
        gix_object::Object::Blob(gix_object::BlobRef{ data: b"xoxo"}.into())
    );

    let mut files = (0..)
        .map(|_| {
            let mut buf = [0;10];
            rng.read_exact(&mut buf).unwrap();
            hexencode(&buf)
        })
        .take(1500)
        .collect::<Vec<String>>();
    files.sort();

    let mode = gix_object::tree::EntryKind::Blob.into();
    let t0 = store(&mut objects, gix_object::Tree {
        entries: files.iter().map(|fname| gix_object::tree::Entry {
            mode,
            oid,
            filename: fname.as_str().into(),
        }).collect(),
    }.into());

    let obj = &objects[&t0];
    let mut buf = Vec::new();
    obj.write_to(&mut buf).unwrap();
    let e = gix_pack::data::output::Entry::from_data(
        &gix_pack::data::output::Count::from_data(t0, None),
        &gix_object::Data::new(obj.kind(), &buf),
    ).unwrap();

    println!("compressed size {}", e.compressed_data.len());
    println!("decompressed size {}", e.decompressed_size);
    std::process::exit((e.compressed_data.len() > 25000).into());
}

xmo-odoo avatar May 26 '25 06:05 xmo-odoo

If the previous default was miniz_oxide, that does make sense: zlib-rs and zlib-ng have a much faster level 1, but it also compresses much worse. Generally, zlib-rs compression level 2 should be both faster and compress better than stock zlib at level 1 (and I'd guess miniz_oxide follows the stock zlib compression levels). See also

https://github.com/trifectatechfoundation/zlib-rs/tree/main/libz-rs-sys#compression-levels

folkertdev avatar May 26 '25 07:05 folkertdev

I guess in that case the solution would be to compress the data on the application side and create the Entry by hand instead of using the from_data helper? (it's not like that's hugely difficult).

xmo-odoo avatar May 26 '25 08:05 xmo-odoo

That's great to hear! Thanks also for the reproduction script!

I had a feeling that this might be the default compression settings which are different in miniz_oxide.

Looking at the Git source revealed that the issue ultimately stems from Gitoxide relying on a single hard-coded value, despite Git being more complex here.

For instance, there is a core.compression configuration value that we'd have to start passing through.

For starters, one might be able to change the compression level to Compression::default(), which is what Git does by default. Of course, through flate2 the level then is equal to 6 instead of -1 to leave it to the ZIP implementation, but it's worth a shot. Edit: fast() might have been an artefact from miniz_oxide times, where one might not have been able to afford better (or default) compression.

Could @xmo-odoo help?

Byron avatar May 26 '25 08:05 Byron

That's great to hear! Thanks also for the reproduction script!

I had a feeling that this might be the default compression settings which are different in miniz_oxide.

Looking at the Git source revealed that the issue ultimately stems from Gitoxide relying on a single hard-coded value, despite Git being more complex here.

For instance, there is a core.compression configuration value that we'd have to start passing through.

For starters, one might be able to change the compression level to Compression::default(), which is what Git does by default. Of course, through flate2 the level then is equal to 6 instead of -1 to leave it to the ZIP implementation, but it's worth a shot. Edit: fast() might have been an artefact from miniz_oxide times, where one might not have been able to afford better (or default) compression.

core.loosecompression and pack.compression are also likely values of interest.

According to the git-config docs:

  • core.loosecompression defaults to core.compression, or 1 (best speed) if that's not set either
  • pack.compression defaults to core.compression or -1 (default ~ 6 in standard zlib) if that's not set either

So it might be better to remove the default compression level entirely, and require the callers to pass in a compression level explicitely? Tests aside there are only 3 which makes sense:

  • gix-pack's output::Entry
  • gix-odb's loose object store
  • gix-odb's Sink (I'll admit I'm not entirely sure why, benchmarking?)

Could xmo-odoo help?

Maybe? I mostly use the lower level packages so I don't really know the organisation of the CLI. But I could probably update the APIs from the bottom to add support for customisable compression and follow the compilation errors until it seems to make sense to stop and retrieve configuration data.

xmo-odoo avatar May 26 '25 10:05 xmo-odoo

core.loosecompression and pack.compression are also likely values of interest.

Definitely.

So it might be better to remove the default compression level entirely, and require the callers to pass in a compression level explicitely? Tests aside there are only 3 which makes sense:

Yes, that's the plan for the correct implementation.

However, I am open to set the default value to something in between that adjusts for zlib-rs higher performance, as a better trade-off between speed and compression ratio. Such a thing would be trivial to implement, but I'd certainly also welcome the full implementation that forces callers to pass the compression level.

Byron avatar May 26 '25 10:05 Byron