gitoxide icon indicating copy to clipboard operation
gitoxide copied to clipboard

The data representation of tree is lossy and prevents round-tripping

Open pierrechevalier83 opened this issue 9 months ago • 5 comments

Current behavior 😯

The data representation of Tree is lossy compared to the git representation, which means that some git trees cannot round-trip through git-oxide while remaining byte-for-byte identical.

Specifically, the entry mode in git is represented as a 6 digit byte string which represents the octal representation of the mode. For the EntryMode that corresponds to a Tree, the octal value is 0o040000u16, which can be represented as b"40000" or b"040000". Both representations occur in the wild, as shown in the test fixture: special-1.tree where both co-exist in the same tree representation.

In gitoxide, we parse this string as a u16 before getting to the EntryKind enum representation, where we loose information on whether or not the tree EntryMode was represented in git with a leading zero.

Expected behavior 🤔

EntryMode and EntryKind should hold on to at least one more bit of information (did the git byte string representation contain a leading zero?) so we can round-trip from git -> gitoxide -> git while preserving the byte representation.

Git behavior

The entry mode in git is represented as a 6 digit byte string which represents the octal representation of the mode. For the EntryMode that corresponds to a Tree, the octal value is 0o040000u16, which can be represented as b"40000" or b"040000".

Steps to reproduce 🕹

See this change to the test that verifies we can round-trip. It currently fails: From gix-object,

cargo test

fails with:

---- tree::from_bytes::special_trees stdout ----

thread 'tree::from_bytes::special_trees' panicked at gix-object/tests/object/tree/from_bytes.rs:107:9:
assertion `left == right` failed
  left: [52, 48, 48, 48, 48, 32, 99, 104, 114, 111, 109, 101, 0, 126, 87, 230, 62, 203, 152, 244, 156, 233, 186, 220, 52, 58, 28, 34, 169, 230, 175, 227, 116, 52, 48, 48, 48, 48, 32, 101, 100, 105, 116, 111, 114, 115, 0, 87, 15, 194, 205, 208, 108, 98, 188, 56, 180, 7, 251, 45, 138, 241, 73, 89, 164, 134, 70, 49, 48, 48, 54, 52, 52, 32, 106, 115, 98, 105, 110, 46, 106, 115, 0, 126, 117, 18, 40, 234, 142, 9, 4, 82, 165, 64, 249, 109, 30, 33, 21, 76, 250, 170, 246, 52, 48, 48, 48, 48, 32, 114, 101, 110, 100, 101, 114, 0, 190, 206, 230, 168, 101, 95, 18, 40, 56, 73, 21, 0, 149, 163, 200, 188, 95, 208, 160, 82, 52, 48, 48, 48, 48, 32, 118, 101, 110, 100, 111, 114, 0, 97, 13, 187, 16, 54, 17, 224, 60, 151, 2, 230, 118, 36, 31, 214, 237, 68, 155, 190, 96]
 right: [48, 52, 48, 48, 48, 48, 32, 99, 104, 114, 111, 109, 101, 0, 126, 87, 230, 62, 203, 152, 244, 156, 233, 186, 220, 52, 58, 28, 34, 169, 230, 175, 227, 116, 52, 48, 48, 48, 48, 32, 101, 100, 105, 116, 111, 114, 115, 0, 87, 15, 194, 205, 208, 108, 98, 188, 56, 180, 7, 251, 45, 138, 241, 73, 89, 164, 134, 70, 49, 48, 48, 54, 52, 52, 32, 106, 115, 98, 105, 110, 46, 106, 115, 0, 126, 117, 18, 40, 234, 142, 9, 4, 82, 165, 64, 249, 109, 30, 33, 21, 76, 250, 170, 246, 48, 52, 48, 48, 48, 48, 32, 114, 101, 110, 100, 101, 114, 0, 190, 206, 230, 168, 101, 95, 18, 40, 56, 73, 21, 0, 149, 163, 200, 188, 95, 208, 160, 82, 48, 52, 48, 48, 48, 48, 32, 118, 101, 110, 100, 111, 114, 0, 97, 13, 187, 16, 54, 17, 224, 60, 151, 2, 230, 118, 36, 31, 214, 237, 68, 155, 190, 96]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tree::from_bytes::special_trees

pierrechevalier83 avatar Mar 14 '25 11:03 pierrechevalier83

Thanks a lot for reporting, and for providing everything that's needed for a reproduction! Also, apologies for the late response.

It will be a while before I get to fixing it because I have plenty of unfinished in the queue right now, but I will get to it eventually.

My thinking here is twofold. Borrowed trees can just hold on to the mode-string itself, backed by a data-buffer, and convert the entry mode on the fly. The owned version would then probably get an additional field where it keeps track of these 6 bytes verbatim and inline, and the mode would be decoded from that on the fly as well, while being settable through a setter then. Or something like that 😁.

I do fear that there is more kinds of mode-byte sequences out there that can't be captured with a single bit, hence my tendency towards keeping the bytes themselves.

Byron avatar Mar 21 '25 05:03 Byron

keep track of these 6 bytes verbatim and inline

I agree, this sounds like the soundest way to resolve this issue and similar issues if they arise in the future.

pierrechevalier83 avatar Mar 21 '25 15:03 pierrechevalier83

Thanks! And please be my guest in tackling this before me - you may be faster, I am way behind everything now.

Byron avatar Mar 22 '25 03:03 Byron

And please be my guest in tackling this before me

I gave it a go in PR 1917. Got CI to a green state.

The main issues I may have introduced would be to depend on EntryMode where EntryModeRef may be more advisable, which may hurt performance.

I tried to feel the vibe of the various call-sites and made a judgement call in each case. Happy to hear feedback if my vibe reading was poor in some instances.

pierrechevalier83 avatar Apr 02 '25 16:04 pierrechevalier83

Thanks a lot, that's awesome!

The main issues I may have introduced would be to depend on EntryMode where EntryModeRef may be more advisable, which may hurt performance.

cargo bench shows that this unfortunately is the case, but I am hopeful that these significant slowdown can be reduced to just a small percentage, particularly once EntryMode is Copy again.

Byron avatar Apr 03 '25 01:04 Byron

Hi! I was spending some time looking for an issue to start working on - however this one looks like it has already been fixed, maybe worth closing?

arumugamramaswamy avatar Sep 19 '25 16:09 arumugamramaswamy