gitoxide icon indicating copy to clipboard operation
gitoxide copied to clipboard

Smart-GC - build a big pack from smaller ones (incrementally) as file handles get scarce

Open Byron opened this issue 5 years ago • 4 comments

It's too easy to get bitten by git pull not working anymore as git cannot perform its operations anymore. This is because each fetch MAY create a new pack in libgit2. And even if not, and packs are exploded into loose objects, usually 10_000 of these warrant the creation of a new pack. It's just a matter of time until there are too many of them for the git-repository to memory map all of these for fast access.

Gitoxide should be better and either GC automatically once files cannot be opened anymore due to insufficient handles, or track its own usage to know when they are about to be tight and do an incremental GC to combine a few packs into one, fast.

Another avenue would be try to map only packs that actually have objects we are interested in, avoiding to map all by default, if that's even possible or viable.

Related to https://github.com/rust-lang/docs.rs/pull/975 , and probably many more.

Byron avatar Aug 16 '20 13:08 Byron

try to map only packs that actually have objects we are interested in, avoiding to map all by default

I think that would be the ideal solution for docs.rs. Right now we have giant spikes in the number of files opened every time we update the index:

image

It's gone down a lot since we started running GC (it used to be 3k files!) but it's still fairly high. I didn't realize crates-index-diff was loading every file, that would explain it :P

There might still need to be some way to GC since otherwise the number of packfiles would keep going up over time and operations on the index would get slower.

GC automatically once files cannot be opened anymore due to insufficient handles

That wouldn't be great, since (for docs.rs) it means that we can't serve any web requests until gitoxide notices the difference and finishes performing the GC - we only call it every minute.

jyn514 avatar Aug 16 '20 14:08 jyn514

Thanks for chiming in! I particularly like to see these spikes for once with my own eyes! Curious as to how that will look like with gitoxide one fine day.

I didn't realize crates-index-diff was loading every file, that would explain it :P

No no :), it doesn't. All it does it use the high-level git2 API to fetch and to diff.

There might still need to be some way to GC since otherwise the number of packfiles would keep going up over time and operations on the index would get slower.

True, smart really needs to be that, and to 'the right thing™️' to not loose performance noticeably over time. I yet have to see how performance turns out for common operations as well.

That wouldn't be great, since (for docs.rs) it means that we can't serve any web requests until gitoxide notices the difference and finishes performing the GC - we only call it every minute.

In my perfect naive world, it can do an incremental GC to avoid slowdowns and serve the request in the same or less time than libgit2, so it appears entirely free. Too many loose objects can also be an issue, so it is probably dealing with both, too many loose objects and too many packs.

Byron avatar Aug 16 '20 14:08 Byron

No no :), it doesn't. All it does it use the high-level git2 API to fetch and to diff.

❤️

In my perfect naive world, it can do an incremental GC to avoid slowdowns and serve the request in the same or less time than libgit2, so it appears entirely free.

Oh I was confusing myself - we'd only hit the file limit when we call crates-index-diff. So if it can notice hitting the limit automatically then it can run gc immediately, and there'd be very little time when we were at the limit.

I'd still prefer for it to be no time, but that sounds like it could work.

jyn514 avatar Aug 16 '20 14:08 jyn514

I think I will (in time) be writing a test for this - clone a the index locally at a long past commit, and simulate the parent repository gaining a few commits, then pull, and repeat while watching how files pile up in the current implementation.

Repeat with gitoxide to verify it really is working. This could even be a benchmark of some sort. Anyway, I am convinced that with an own implementation of the entire machinery it's possible to find new and better ways of dealing with this. criner has the very same problem, and right now I am running gc manually when the thing fails to fetch consistently. Argh, so annoying :D.

Byron avatar Aug 16 '20 14:08 Byron