The data representation of tree is lossy and prevents round-tripping
Current behavior 😯
The data representation of Tree is lossy compared to the git representation, which means that some git trees cannot round-trip through git-oxide while remaining byte-for-byte identical.
Specifically, the entry mode in git is represented as a 6 digit byte string which represents the octal representation of the mode.
For the EntryMode that corresponds to a Tree, the octal value is 0o040000u16, which can be represented as b"40000" or b"040000". Both representations occur in the wild, as shown in the test fixture: special-1.tree where both co-exist in the same tree representation.
In gitoxide, we parse this string as a u16 before getting to the EntryKind enum representation, where we loose information on whether or not the tree EntryMode was represented in git with a leading zero.
Expected behavior 🤔
EntryMode and EntryKind should hold on to at least one more bit of information (did the git byte string representation contain a leading zero?) so we can round-trip from git -> gitoxide -> git while preserving the byte representation.
Git behavior
The entry mode in git is represented as a 6 digit byte string which represents the octal representation of the mode.
For the EntryMode that corresponds to a Tree, the octal value is 0o040000u16, which can be represented as b"40000" or b"040000".
Steps to reproduce 🕹
See this change to the test that verifies we can round-trip.
It currently fails:
From gix-object,
cargo test
fails with:
---- tree::from_bytes::special_trees stdout ----
thread 'tree::from_bytes::special_trees' panicked at gix-object/tests/object/tree/from_bytes.rs:107:9:
assertion `left == right` failed
left: [52, 48, 48, 48, 48, 32, 99, 104, 114, 111, 109, 101, 0, 126, 87, 230, 62, 203, 152, 244, 156, 233, 186, 220, 52, 58, 28, 34, 169, 230, 175, 227, 116, 52, 48, 48, 48, 48, 32, 101, 100, 105, 116, 111, 114, 115, 0, 87, 15, 194, 205, 208, 108, 98, 188, 56, 180, 7, 251, 45, 138, 241, 73, 89, 164, 134, 70, 49, 48, 48, 54, 52, 52, 32, 106, 115, 98, 105, 110, 46, 106, 115, 0, 126, 117, 18, 40, 234, 142, 9, 4, 82, 165, 64, 249, 109, 30, 33, 21, 76, 250, 170, 246, 52, 48, 48, 48, 48, 32, 114, 101, 110, 100, 101, 114, 0, 190, 206, 230, 168, 101, 95, 18, 40, 56, 73, 21, 0, 149, 163, 200, 188, 95, 208, 160, 82, 52, 48, 48, 48, 48, 32, 118, 101, 110, 100, 111, 114, 0, 97, 13, 187, 16, 54, 17, 224, 60, 151, 2, 230, 118, 36, 31, 214, 237, 68, 155, 190, 96]
right: [48, 52, 48, 48, 48, 48, 32, 99, 104, 114, 111, 109, 101, 0, 126, 87, 230, 62, 203, 152, 244, 156, 233, 186, 220, 52, 58, 28, 34, 169, 230, 175, 227, 116, 52, 48, 48, 48, 48, 32, 101, 100, 105, 116, 111, 114, 115, 0, 87, 15, 194, 205, 208, 108, 98, 188, 56, 180, 7, 251, 45, 138, 241, 73, 89, 164, 134, 70, 49, 48, 48, 54, 52, 52, 32, 106, 115, 98, 105, 110, 46, 106, 115, 0, 126, 117, 18, 40, 234, 142, 9, 4, 82, 165, 64, 249, 109, 30, 33, 21, 76, 250, 170, 246, 48, 52, 48, 48, 48, 48, 32, 114, 101, 110, 100, 101, 114, 0, 190, 206, 230, 168, 101, 95, 18, 40, 56, 73, 21, 0, 149, 163, 200, 188, 95, 208, 160, 82, 48, 52, 48, 48, 48, 48, 32, 118, 101, 110, 100, 111, 114, 0, 97, 13, 187, 16, 54, 17, 224, 60, 151, 2, 230, 118, 36, 31, 214, 237, 68, 155, 190, 96]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
tree::from_bytes::special_trees
Thanks a lot for reporting, and for providing everything that's needed for a reproduction! Also, apologies for the late response.
It will be a while before I get to fixing it because I have plenty of unfinished in the queue right now, but I will get to it eventually.
My thinking here is twofold. Borrowed trees can just hold on to the mode-string itself, backed by a data-buffer, and convert the entry mode on the fly. The owned version would then probably get an additional field where it keeps track of these 6 bytes verbatim and inline, and the mode would be decoded from that on the fly as well, while being settable through a setter then. Or something like that 😁.
I do fear that there is more kinds of mode-byte sequences out there that can't be captured with a single bit, hence my tendency towards keeping the bytes themselves.
keep track of these 6 bytes verbatim and inline
I agree, this sounds like the soundest way to resolve this issue and similar issues if they arise in the future.
Thanks! And please be my guest in tackling this before me - you may be faster, I am way behind everything now.
And please be my guest in tackling this before me
I gave it a go in PR 1917. Got CI to a green state.
The main issues I may have introduced would be to depend on EntryMode where EntryModeRef may be more advisable, which may hurt performance.
I tried to feel the vibe of the various call-sites and made a judgement call in each case. Happy to hear feedback if my vibe reading was poor in some instances.
Thanks a lot, that's awesome!
The main issues I may have introduced would be to depend on
EntryModewhereEntryModeRefmay be more advisable, which may hurt performance.
cargo bench shows that this unfortunately is the case, but I am hopeful that these significant slowdown can be reduced to just a small percentage, particularly once EntryMode is Copy again.
Hi! I was spending some time looking for an issue to start working on - however this one looks like it has already been fixed, maybe worth closing?