libgit2
libgit2 copied to clipboard
Shallow support
This PR enhance our shallow object support in the following ways :
- we behave better when revwalking (we don't error with
GIT_ENOTFOUND
). - we don't return objects with known-missing parents (since our repository is shallow and thus doesn't have those objects).
Supersedes #3853, depends on #4445 and #4446.
~~This PR adds graft support at the ODB layer, with git_repository
being responsible for telling git_odb
about known grafts. Shallow object support is then grafted (heh) on top of that.~~
Notes :
~~- I'm not that convinced that having grafts at the ODB layer makes sense. FWIW, the crux of the issue is that object parsing is not done when objects are returned by the ODB, but by the caller (usually the object
parse machinery, but revwalk
also perform crude parsing for obvious reasons).~~
- @carlosmn pointed out the preferred approach (ie. that's what
git
does) is to have grafts be "transparent". IMHO, reporting an error in that case might help in the following cases :- every object parsed is checked against the list of known grafts (so perf). The error code would just be a subset of
GIT_ENOTFOUND
. - as a library user, I would have to recheck every object to know if I'm looking at a graft (with the caveat that this PR doesn't actually exposes grafts, only shallow commits). Having the error code makes me automatically aware (and I can just treat it as
GIT_EITEROVER
if I don't actually care).
- every object parsed is checked against the list of known grafts (so perf). The error code would just be a subset of
I'll keep working on handling protocol part (eg. git clone --depth
and git fetch --unshallow
), which doesn't really matter for correct handing of grafts & shallow commits (at least locally).
Just realized I should ping #3058 so we don't get duplicated work. Also, I have a local branch with the beginning of the protocol support which I'll happily publish if someone plans to work on that.
Is there any progress on this merge? Looking to see a fix for shallow clones!
I'd appreciate a quick technical "looks good" review, since actually shallow support at the protocol level will depend on some of the infrastructure I laid out here — except I'll need an atomic way of updating the shallow file which wasn't needed by the local-only fix.
So, here are the questions :
- ⚠️ GitHub is utterly confused about the commit ordering in this PR.
- commits 9150c16, 2197072, 5f00ac7 are about merging the commit parsing done when revwalking with the "standard" object-lookup one, and could be submitted separately.
- commit d057213 adds grafts registration support at the
git_odb
level, which is used by 8a76d68 aea2e72 c54ea4c and 82c3a5f. I'm thinking this could be moved up to thegit_repository
level, opinions ? - 974d720 and 7df1483 handle shallow objects on top of the aforementioned graft API.
I'll repeat what I said on Slack for the record : since I'm coming at this from the shallow-support angle, which cgit
uses grafts for, I've tried to keep it logical and do the same thing. This means I'm clueless about the git-replace
stuff, apart from what I've read from the manual : it "grafts" complete objects, and uses refs to keep track of them, hence those "replacement objects" can be pushed/pulled around, while grafts (and thus shallow objects) can't.
This PR only exists because of shallow-support, hence I've checked how cgit
does it, and it does not use git-replace
for that, so I considered git-replace
support "out of scope".
So one thing I'm a bit confused about is how grafting and replace work with each other or against each other. As far as I understand it, a graft is a new commit with the exact same content as another commit, but with parents replaced.
"replaces" are real objects and AFAIU work like you expected. "grafts" are just some OIDs loaded from .git/info/grafts
, which are used to override a commit's parents…
The thing to note here is that this graft is part of the ODB, that is it is a commit object part of the ODB with its own OID.
…thus, no. "grafts" have no real existence (IOW object form). I located my graft function at the ODB level because 1) I got confused thinking that .git/info/
was part of the odb, 2) it felt the most obvious place I could tack them on and impact both revwalking and object lookups.
If I'm not mistaken, you are replacing the old logic used by grafts, which weren't using the new replacement logic. Was this a knowing decision, and if so why?
"replacing" I don't understand. If you meant "implementing", then yes, because that's how cgit
does it.
We seem to agree on how git-replace
works. I'll just add that I'm not sure cgit
"grafts" anything when it's used, since I didn't see anything related to replaces in the graft-handling code (% overlook probability), hence why I consider git-replace
out of scope.
I'd be happy to discuss what final API we want : for shallow support, I only need "internal" things, so I don't particularly care (I mean, I could not provide any public API apart from the list of shallow roots and be done with it 😉). Right now, that's handled by git_odb__graft_register
, which could be made public.
I'll reinvestigate the layering I chose : because of my 2 points above, I might not need to go near the odb at all, but I have to consider a .git/shallow
/.git/info/grafts
file appearing "somehow" (IOW caching, fun things™)
If it's warranted I can split out that PR in 3 pieces (already have them locally, and if it fixes GitHub's confusion that's be even better) : the commit parser merging (because of #4374 and being generally cleanup), the graft support itself (this), and shallow support (which requires this).
Rebased. I managed to remove most of the "weirdness" by moving the grafts at the repo level.
Okay, so I think I've been confused a bit by grafts and replaces. What makes me curious is the following part from git-replace(1):
--graft
[ ...] Create a graft commit. A new commit is created with the same content as except that its parents will be [ ...] instead of 's parents. A replacement ref is then created to replace with the newly created commit. See contrib/convert-grafts-to-replace-refs.sh for an example script based on this option that can convert grafts to replace refs.
This paragraph implies that the grafted commit is stored inside of the ODB, as otherwise it would be impossible to put the reference into place. From what you say, it looks to me like there's two different things here:
- an in-memory grafted commit which is not written into the ODB, used by the old grafts
- an in-ODB grafted commit which is written into the ODB, used by the new replace mechanism
So I guess both grafting and replaces use the same underlying mechanism of grafting, but differ only in how they are being referenced and persisted. While old-style grafts make use of out-of-band information, the new replace mechanism is simpler by making use of the ODB and refdb. It would be nice though if we could also check how git.git does it, instead of relying on another third-party implementation.
Anyway. Seeing that both replaces and grafts make use of the same mechanisms, it would be nice if they shared some common logic with each other later on.
Your interpretation of git-replace
looks correct to me. That --graft
option creates a "new style" replace object that works the same as the "old style" graft mechanism (same name, different implementation).
What I'm sure of is that cgit
shallow objects use old-style grafts, which don't appear in the ODB at all (or you'd have strange things happen when pushing a shallow clone back to its origin). This means that, even though the old graft mechanism could be considered "deprecated" in the face of "git-replacing", which is superior, it's still used by the shallow code because you don't want the user to mess up its shallow clones.
I think that because of that specific point, both "object replacement methods" cannot share the same implementation, but I'll have to investigate how git-replace(1) actually works and what kind of functionality it uses to be sure.
Okay, that's most welcome, thanks a lot.
By the way, I'm all in favor of splitting this up, as you proposed. Especially if you've already got the parts ready it's always easier to get smaller batches reviewed and merged than one big batch.
@pks-t Okay, I've split the 3 logical parts off. Sadly I can't really base them on top of one another because they're in my fork (and opening PRs against that would be silly), so it's not as clean as we might have wanted.
One thing I'd like to discuss still is how we report those shallow objects. Right now, this is completely "transparent" (as per spec, may I say), but you, as a user, would have no way of knowing whether this is a "pristine commit" or a shallow one, without suspiciously checking the OID of every "root" commit against the list of shallow roots. For example, in GitX, I'd be able to open a prompt asking whether the user wants a few more commits loaded when revwalking (or something). Arguably, I might be overthinking it though 😉.
All 3 parts have just been updated, fixing a few leaks and adding some more tests (mostly to the graft API). I've also moved the graft functions in their own file instead of having them in commit.c/h.
This makes the local part of shallow support complete, so libgit2 can work with grafted/shallowed repositories. I'm still undecided on whether git_revwalk_next
on a grafted commit should return GIT_EGRAFT
or GIT_EITEROVER
though (the latter is currently done).
This full PR/patch set has been rebased, and should be ready for review.
I'm drawing a line in the sand as to what this is about: this only makes detection of shallow repositories possible. I'll do another PR about the "protocol" parts of shallow when I feel it's ready, because it's (arguably) harder. Hence, some safety measures are missing: I think a fetch done by us from a shallow repo would unshallow it, since we can't tell the remote about what we're purposefully not interested in (though it could also confuse the server 😬 ?). Same for pushing, though I'm clueless about what's involved there. It might be worthwhile to prevent misbehaving by erroring when those operations are tried on a shallow repository.
My point about checking for a grafted commits when a revwalk ends is also still undecided. I like "erroring differently" better, but I understand that might be considered an API-break. Maybe git_commit_is_grafted
or something like it…
What's the status of this? Really looking forward to shallow clone support ...
Is there any progress on this problem? There are lots of issues that are demand on this feature.
Ping @pks-t.
As far as I understand this stops git revwalk from exiting with an error if a local commit is missing. @carlosmn @ethomson @pks-t What needs to be done to get this PR merged?
If OP has gone inactive since submitting this let me know and I'll be happy to take over with a new PR.
@tiennou is one of the most active people nowadays in the project, so I'd be surprised if he was inactive now. But open source being what it is it might still take some time ;)
It's likely I'll only able to take a look at it in a couple of days. No sweat for the "late" review @pks-t, better late than never ! Frankly, if you feel up for it @Qix- I'll gladly defer the PR/polishing part, as I'd personally prefer to keep working on the ssh backend issue — as the review shows it's been a little while.
Just a quick discussion point : as the overarching goal is to be git
-like, I don't really feel compelled to provide anything more featureful than what's needed to support shallow repositories, with the rationale that git has gained git-replace
(a.k.a. the refdb-based grafts, which do look nicer). I am thus confused — to me grafts are a low-level detail of a since-replaced-by-a-better-version system, and don't really warrant a more precise cache, or more attention than that. Maybe I'm missing something though, so I'm open to usecases 😉.
I am thus confused — to me grafts are a low-level detail of a since-replaced-by-a-better-version system, and don't really warrant a more precise cache, or more attention than that. Maybe I'm missing something though, so I'm open to usecases .
I fully agree that we shouldn't blow up the grafts mechanism more than needed. But my intention is to have it as self-contained as possible to not spill grafts-related logic into other places. Thus the standalone git_grafts
struct, git_grafts__refresh
, etc. This will make it more maintainable and help us if there is ever a request to expose some of its functions it to users.
I'll create a separate PR that builds on yours to show what I imagine it to look like :)
@pks-t This can be closed, I believe.