DeepForest icon indicating copy to clipboard operation
DeepForest copied to clipboard

Reduce the size of the DeepForest git repo history.

Open bw4sz opened this issue 2 years ago • 2 comments

I had to clone the full git history from @ethanwhite fork the other day and it came in at nearly 1GB. A shallow clone is much much smaller than that. Clearly we have some bad history in there and we need to some management.

bw4sz avatar Mar 21 '23 17:03 bw4sz

The .git folder is definitely huge at 1.1 GB. Some of this is old files that are no longer in HEAD and some of which are large files that we are currently using (most notably several 20+ MB images in docs.

I can get us down to ~100 MB, which feels more reasonable, but it does require eliminating docs images from the history. To do that we would first replace those files with much smaller versions, but give them different names, and then do the filter-repo work to get us down to ~100 MB.

The bad news is that doing any of this requires completely rewriting the history for the whole repo. This will break all existing forks and branches (including all open PRs and all in-progress work). I would need to read up on its impact on tags, but from a quick experiment it looks like it will keep the tags by associating them with the new hashes, which means the filtered files will no longer be present (which makes sense).

So, I certainly think there's a reasonable argument for doing this, but it will need a long ramp up where we:

  1. Warn folks with forks about what is happening
  2. Wrap up all ongoing work and merge or close all PRs
  3. Stop all work
  4. Do the filtering and push the smaller version
  5. Let folks know how to update their forks so that they can return to work.

ethanwhite avatar Apr 22 '23 19:04 ethanwhite

Here are some useful code chunks for doing this work:

Get the biggest files in the history

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Get the biggest files in the history not in HEAD

This is a one line addition to the above. In my working with this it only sort of worked and some files that were in HEAD still appeared.

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Use git-filter-repo to remove a directory

git-filter-repo --path directory_to_be_removed --invert-paths

Use git-filter-repo to remove multiple files

git-filter-repo --path /path/to/file1.ext --path /path/to/file2.ext --invert-paths

Run git maintenance just to make sure things are fully optimized

git maintenance run

ethanwhite avatar Apr 22 '23 20:04 ethanwhite