Automate git repository maintanence operations

Open nellh opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. OpenNeuro datasets often have a fairly low rate of new commits after the initial upload but some upload patterns can create many new git objects and we don't automatically run git repack or git gc at any point. For datasets with very large file counts, this can lead to very bad performance bottlenecked by I/O scan of the git objects.

Describe the solution you'd like Other git based hosting services tend to run git gc and git repack at automatic commit intervals (GitLab describes it here). We tend to have larger commits and less of them, I suspect it would be good to run at least the incremental repack whenever a dataset has new commits that haven't been repacked. gc and a full repack shouldn't be required after most changes but we may want schedule running these if the file count for a commit is over a threshold?

Additional context ds004186 had 100k object files and 200MB of git metadata after upload, leading to git tree reads near or exceeding OpenNeuro's 60 second timeout for this. A git repack and git gc got this down to 1 pack file and 19MB of metadata and reading a tree was down to ~2 seconds.

Jul 13 '22 19:07 nellh

We should add git annex fsck, and possibly git fsck if this is not implicit.

If fsck is unacceptably long-running, it is possible to use --incremental/--more and --time-limit to break the problem down into conveniently schedulable chunks. --time-limit just raises an exception, so I believe we can also kill it by sending a SIGINT when convenient, and it will save its place.

https://git-annex.branchable.com/git-annex-fsck/

Sep 21 '22 20:09 effigies