Gittyup Indexer is unable to cope with very large repositories

This will have to be split into multiple issues, but at the core of it, opening up a very large repository in Gittyup has some issues. Most notably, the indexer will run in the background for an extensive amount of time. My use case and workflow involves working with Yocto / Poky and Linux repositories for multiple boards. Yocto / Poky is a workable situation where the indexing will eventually stop, but for the Linux kernel I quite frankly use a different git client for it. In any case, it's not a laptop-friendly activity.

Having looked at the code for a bit, I think the general architecture here is unlikely to scale well. Particularly, running a full-blown indexer on everything probably isn't a good solution for very large repositories. Indexing the diffs and doing a run-time look-up for the commits is probably far faster overall. Similarly, the idea that the entire search result is found in one go is maybe not so scalable. Searching from newest to oldest and presenting the results as they come might provide an acceptable responsiveness while not requiring a massive index.

That said though, maybe supporting the Linux kernel isn't worth considering, and from that POV and keeping things mostly as they are, I see the following issues

As mentioned in #868, the read and write performance is lacking due to the short read and writes. The read aspect is fixed by #873, which leaves the writing to be solved
The code, and to a large extent libgit2 itself, doesn't seem to perform particularly well in a multi-core system. Throwing more than 4 cores at the indexer doesn't improve performance at all since it's just hammering a read-write lock in the cache system (see git_cache_get_any and underlaying code).
At least when testing with large repositories, it seems like libgit2 is starved on cache. Doubling the cache size and number of objects leads to pretty substantial gains. If the system has plenty of free RAM, bumping up the limits at run-time would be an option to consider.
A lot of what's happening is actually single threaded, which you can see while it's running since it's toggling between using multiple cores and one core. One processing loop consist of the sequential stages
- single-thread iteration over RevWalk to buffer up new commits
- The buffer is fed into QtConcurrent::mappedReduced which spreads the mapping and reduction work over multiple threads, which in turn will hammer the read-write lock mentioned in point 2.
- Writing the output from QtConcurrent::mappedReduced to disk
All three activities could very well happen in parallel, thus adopting some form of producer-consumer pattern would be helpful here. I did some prototyping yesterday that indicates this alone could double the speed. I suspect if it's refined a bit, the memory usage could be lowered a bit as well.
After analysing the code a bit, I realise that the writing to disk follows a quadratic complexity in execution time, given that the files are written from scratch each time. This will eventually be the predominant slow-down. Writing deltas to the file would be ideal, but seems quite non-trivial to implement

Aug 07 '25 17:08 AHSauge

I'd like to come back to using gittyup not just for small repos but also some fairly large ones, game projects with Unity can get pretty massive with large number of files with plugins, external dependencies, number of commits and not just larger files.

So I'd take any performance gain there is some of those points might be easier to fix.

Aug 08 '25 16:08 XeonG

I'd like to come back to using gittyup not just for small repos but also some fairly large ones, game projects with Unity can get pretty massive with large number of files with plugins, external dependencies, number of commits and not just larger files.

So I'd take any performance gain there is some of those points might be easier to fix.

Yeah, and there are also some OOM issues regarding the indexer as well that could probably be improved, especially if the writing has a more linear complexity in execution time. Right now the indexer is rewriting the entire file when the index files are updated. I've not even tried to see how this looks when you fetch and pull, but I can only assume it also follows a quadratic increase in execution time.

Aside from fixing libgit2 itself, I've got some ideas about how to solve these issues. I'm testing with the Yocto / Poky repo., and rewriting it to be more multi-thread friendly, I see about 60% decrease in execution time. That's without solving the writing issues.

For the write time, I'm thinking I'll eventually try out dumping the deltas into a separate file and then do one pass over that to rewrite the actual index files. That might be quite a non-trivial job, but should have a linear complexity in the end. That said though, it might be better to cave in and rely on an existing solution for this, but I've yet to find a maintained C/C++ library that does inverse indexing (which is what the current indexer is implementing). If someone happened to know of that, then that would be an interesting thing to explore.

Aug 08 '25 18:08 AHSauge

@XeonG If you're able to compile Gittyup yourself, then you can try out https://github.com/AHSauge/Gittyup/tree/flavour/ahsauge I merged in some improved multi-threading (#877), plus some file write improvements I've yet to make a pull request for. It should more than double the performance, but eventually it will also be bound by the file writing.

Do you know of any publicly available repositories that you know are more affected by this? I was thinking of cloning the unreal engine again, but if you know of other large (yet not as large as Linux) repos I can try them out as well.

Aug 09 '25 10:08 AHSauge

Tempted to give it a go and try building it but me an build chains infuriate me with all the things that go wrong and need figuring out, tis why I just stick to unity and use there cross platform support, it just works.

The UE source probably is a good sized repo though.

Hopefully @Murmele will get a windows release out soon though.

Aug 09 '25 16:08 XeonG

I tried with the UE source, but after 45 minutes I gave up running it. To me it seems there's some pretty severe performance issues in libgit2, which could be linked to https://github.com/libgit2/libgit2/issues/3027. Testing with the UE source, I've got no threads maxing out, yet also no immediate indication of locking issues either. Even artificially capping stuff to provoke the diff grabbing thread to max out, it still doesn't, so this is really a mystery to me right now. My best guess is that git_diff_tree_to_tree isn't performing well in conjunction with large files, which is supported by both https://github.com/libgit2/libgit2/issues/3027 and https://github.com/libgit2/libgit2/issues/7074.

Aside from fixing libgit2, the only way to circumvent this problem is to extract a diff by running the git CLI. That on it's own suggest there is a way to run diff faster, but given the age of those issues, I assume really non-trivial optimisations are needed.

Aug 09 '25 22:08 AHSauge