gitoxide icon indicating copy to clipboard operation
gitoxide copied to clipboard

Fetch and clone support (bare)

Open Byron opened this issue 3 years ago • 4 comments

We want shallow clones and this issue tracks what needs to be done to get there.

Prerequisite tasks for bare clones

  • [x] #474
  • [x] #473
  • [x] #475
  • [x] match refspecs for fetching
  • [x] fetch pack for update (#548)
  • [x] #551
  • [x] http transport configuration (i.e. proxy settings, timeouts, useragent) (#586)
  • [x] get progress message by stable id
  • [x] ~~unbuffered progress messages~~ - lines are buffered line by line, but that's it. Hence we receive everything in real-time already.
  • [x] #627
  • [x] support for classifying spurious errors in error return types
  • [x] auto-tags
  • [x] ditch naive negotiation in favor of proper consecutive one (or else clones from some servers may fail) via #861

Follow-ups of ditch naive implementation

Most of these are optional, but represent opportunities to make gix better, so shouldn't be lost.

  • [x] #883
  • [x] #892
  • [x] #887 ~~see if commit_graph() can return our own type connected to Repo, or if the graph can be made to be more convenient to use with gix::Id~~ - ~~not really, but getting traversal with commitgraph support would be great. Probably it can simply be retro-fitted to the existing traversal. But then again, it would speed up generating ids, but most people using that kind of traversal would just want to access commits plainly, which forces loading them anyway. So it's probably OK to keep it as is.~~ - retro-fitted commit-graph support, because it will be useful to some
  • [x] #893
  • [x] #897 (initial version with tracing)
  • [x] gix corpus with a little more to do

Additional tasks

These are for correctness, but don't block cargo integration as no cargo tests depend on them.

  • [x] allow to downgrade connections like git does, should be no problem. Maybe find a way to let the user enforce protocol versions, let's see how git does it.
  • [x] make it possible to not send streaming posts - that is only needed for posting packs and some git servers can't handle 'chunked' encoding that results from it. Lastly, git itself uses content-length as the buffer is pre-created in memory.
  • [x] additional HTTP configuration as per cargo configuration
  • [ ] correctly re-obtain credential helper configuration for each URL (but don't rewrite, it's Remote's only)
  • [ ] make pack tempfiles appear like they do in git to help with cleanup in case of SIGKILL.
  • [ ] ability to turn off 'is currently checked out' sanity check to emulate git fetch --update-head-ok. Cargo passes it to the CLI and maybe it's something we will need too just to make its updates work.

Tasks for proper transport configuration

  • [ ] try to implement complex http.<url>.* based option overrides

Tasks for shallow cloning

Research needed, but the libgit2 issue might be helpful for more hints.

  • #765
  • #770

Research

  • a nice overview document
  • packs are forced non-thin when .git/shallow is present (containing the commits that are the shallow boundary, present, but without parents)
  • shallow repositories can be cloned from and remotes send that information along, making the clone shallow, too.

Watch out

  • Much of this work is happening in git-repository, which is tracked in #470 .
  • subsequent fetches must not accidentally change the depths of the repository, but only fetch what changed inbetween. See point 2 in this comment. Note that I believe that pathological CPU usage in shallow clones on the server has been fixed by now.
  • Ed Page states that according to GitHub employees, shallow clones are only expensive if depth > 1 or converting it back to having full history.

Byron avatar Jul 01 '22 03:07 Byron

I recently encountered an problems to clone a large repository over a extremely slow data link. After a certain timeout, the server (or intermediate proxy) terminated the connection.

Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits). Git gets an error and doesn't unpack the truncated response. Need unpack it manually. And list of 'have'-directives in protocol request didn't help for me. I had to learn the low protocol and recursively fetch each tree-objects (in single, until disconnect) and then missing blob-objects (in batches).

Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken.

(p.s. this shallow cloning took me 24 GB over 1 week)

chazer avatar Jul 24 '22 04:07 chazer

Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits).

Did you try the --depth 1 option? With that git would prepare only the objects that are relevant to the commit at the requested reference. This in conjunction with the --filter option allows to split clones into receiving only trees and then filling in the blobs in a separate commit. That way it's even possible to obtain the entire history, commits only, and trees and blobs for the most recent commit.

Git gets an error and doesn't unpack the truncated response.

That's true - the reason might be that it is unable to validate the received objects as the trailing hash of the received pack would be missing. However, I also have been burned by this which is why there is a special restore mode when receiving a pack. It salvages the received objects at least.

However, the way the git protocol works the server still may send all the objects the next time the reference is requested as the algorithm's granularity is only per commit. With partial packs, it' entirely unclear which objects are present and which aren't unless they are all traversed and verified. So, in order to actually have a benefit from keeping a partial pack one would have to see which commits are completely available (while handling --filter correctly, I presume), to then be able to avoid having these complete commits being resent. Of course there is no guarantee that any commit is actually complete due to the way objects are sorted into a pack, which is to optimize compression as opposed to 'distance to the owning commit'.

That said, there is a bunch of things one could implement to help this case if the client and server would implement some custom extensions.

I had to learn the low protocol and recursively fetch each tree-objects (in single, until disconnect) and then missing blob-objects (in batches).

Awesome, I love it! I would have given up for sure!

Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken.

It would certainly be interesting to learn more about the algorithm you used to split up big clones into many smaller ones as it wouldn't require a server and client extension to the protocol. Such client-side only algorithm could possibly be implemented in gitoxide then, and I am open to that, too.

Byron avatar Jul 25 '22 01:07 Byron

Is this actually complete despite the unfinished tasks in the OP?

ofek avatar Jan 20 '24 15:01 ofek

It works for all intents and purposes but isn’t perfect related to some details. These are still tracked here, maybe they can be moved into a follow-up issue.

Byron avatar Jan 20 '24 16:01 Byron