Reduce the size of the DeepForest git repo history.
I had to clone the full git history from @ethanwhite fork the other day and it came in at nearly 1GB. A shallow clone is much much smaller than that. Clearly we have some bad history in there and we need to some management.
The .git folder is definitely huge at 1.1 GB. Some of this is old files that are no longer in HEAD and some of which are large files that we are currently using (most notably several 20+ MB images in docs.
I can get us down to ~100 MB, which feels more reasonable, but it does require eliminating docs images from the history. To do that we would first replace those files with much smaller versions, but give them different names, and then do the filter-repo work to get us down to ~100 MB.
The bad news is that doing any of this requires completely rewriting the history for the whole repo. This will break all existing forks and branches (including all open PRs and all in-progress work). I would need to read up on its impact on tags, but from a quick experiment it looks like it will keep the tags by associating them with the new hashes, which means the filtered files will no longer be present (which makes sense).
So, I certainly think there's a reasonable argument for doing this, but it will need a long ramp up where we:
- Warn folks with forks about what is happening
- Wrap up all ongoing work and merge or close all PRs
- Stop all work
- Do the filtering and push the smaller version
- Let folks know how to update their forks so that they can return to work.
Here are some useful code chunks for doing this work:
Get the biggest files in the history
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
sed -n 's/^blob //p' |
sort --numeric-sort --key=2 |
cut -c 1-12,41- |
$(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
Get the biggest files in the history not in HEAD
This is a one line addition to the above. In my working with this it only sort of worked and some files that were in HEAD still appeared.
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
sed -n 's/^blob //p' |
grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |
sort --numeric-sort --key=2 |
cut -c 1-12,41- |
$(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
Use git-filter-repo to remove a directory
git-filter-repo --path directory_to_be_removed --invert-paths
Use git-filter-repo to remove multiple files
git-filter-repo --path /path/to/file1.ext --path /path/to/file2.ext --invert-paths
Run git maintenance just to make sure things are fully optimized
git maintenance run