recursive icon indicating copy to clipboard operation
recursive copied to clipboard

Repository file size

Open kalapi opened this issue 2 years ago • 9 comments
trafficstars

Problem description Hi @arrowtype, this is not a bug report, more like a meta issue with the repository itself. I was doing some winter housekeeping on my Mac and found that the Recursive repository is 3.98 GB! And this isn't because of a file (or group of files) in the current state of the repo but probably because of past commits in the hidden .git folder. Honestly I don't know what these files represent. I wonder if you have any insight?

Expected behavior No effect on font binaries

Screenshots Screenshot 2022-12-17 at 5 00 04 PM

To Reproduce I use a Mac app called Grand Perspective which helps visualise large data blocks on the hard drive. One of the larger chunks was the Recursive repository folder and I was really surprised.

Screenshot 2022-12-17 at 5 04 35 PM

Environment (please complete the following information):

  • OS: macOS Catalina 10.15.7
  • Browser: N/A
  • Fonts: N/A
  • (Pulled all latest commits on main)

Additional context N/A

kalapi avatar Dec 18 '22 01:12 kalapi

Thanks for the detailed report!

Hmm, one possibility is that it's just many rounds of many UFO font sources, each with many small files.

I'm not really a git wizard. Do you (or anyone else) have any suggestions of what to do to trim down the size of git repos?

arrowtype avatar Dec 18 '22 01:12 arrowtype

I have no idea how to fix this. I'm going to do some reading and try to figure it out.

kalapi avatar Dec 18 '22 06:12 kalapi

Okay I found a possible solution. I'm documenting the process here but will make a fork of the repo and try it out there. If everything has worked as desired, I'll open a pull request.

My working hypothesis is that the offending files seems to be

  1. A .sketch file which was prototyping the Noordzij cube (src/proofs/final-specimen/create-noordzij-cube-6_sides.sketch)
  2. A .zip file in the fonts directory containing all the binaries at some point

While the files inherently aren't equal to a couple gigabytes, the deltas that reference them over several commits could be multipliers.

Process:

  • Run sh <path-to>/FindBlobs.sh. This will identify files in history above a certain byte size.

  • Once the files have been identified run the following with the filename:

    • Run git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch <folder/file name>' --prune-empty
  • Once filtering has been completed run all the following commands in sequence:

    • git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
    • git reflog expire --expire=now --all
    • git gc --aggressive --prune=now

Sources:

https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history https://stackoverflow.com/questions/11050265/remove-large-pack-file-created-by-git Also try https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-the-git-repository

Contents of FindBlobs.sh

awk command finds file sizes greater than 2^25 bytes (33.554432 MB)

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize:disk) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  awk '$2 >= 2^25'

kalapi avatar Dec 18 '22 08:12 kalapi

Hi @arrowtype I tried a couple of things but they didn't work. I think you might need to talk to someone who has a deeper understanding of version control to help out with this.

Looking at this problem, as a matter of policy, we've made a change (internally within Universal Thirst) to not upload font binaries, PDFs, images and other binaries in new repos.

Feel free to close this and apologies for the bother :)

kalapi avatar Dec 23 '22 05:12 kalapi

Of course I cannot speak for @arrowtype, but if I were the author of this repo, I would not find anything about this thread that was a bother or needs apology. I think many people would be genuinely curious and interested to see if there is something that can be done about the unexpectedly large size.

Edit:

I guess 4 GB including history is not so bad, judging by this document about building a different font:

Note that this repo has a 30+ GB commit history. If you only want the current files and future changes, you can avoid downloading so much by cloning the repo with a --depth limit.

It then goes on to give an example of a limited-depth clone command for that font. So, it would seem the size of the (main, official) Recursive repo is actually not all that unexpected or even unusually large. And maybe the normal thing to do for people who want to play around with a local clone is to not download all the history.

jkyeung avatar Feb 04 '23 18:02 jkyeung

For people who don't need the whole repo history, which is likely no one outside of @arrowtype, use a shallow clone.

Get only the last 20 commits across all branches:

git clone --depth=20 --no-single-branch https://github.com/arrowtype/recursive.git

Only the most recent commit and only the main branch:

git clone --depth=1 https://github.com/arrowtype/recursive.git

This shallowest clone is best for one-off or consume-only copies, because not all git commands will work later, e.g. switch to a branch created before the cloning date.

Disclaimer: I'm not all that git proficient, but learned this trick out of necessity when working with a 20 year old project with very long clone times.

maphew avatar Mar 05 '23 04:03 maphew

Hey, thanks so much for the feedback and insights, @kalapi, @jkyeung, and @maphew! Sorry I've been slow to respond, but I truly appreciate it.

I will try to add the depth tip to the readme, for people that may see it.

I suspect the thing that might take the most time in downloading is that in this repo, I have been committing changes to "build prepped" UFOs, and I have done many, many builds. Each UFO is thousands of tiny files, so I think that probably stacks up a lot. I now put such prepped sources in the .gitignore file of font repos.

I'll try to look further into the resources posted by @kalapi to test out pruning some of those unnecessary sources.

arrowtype avatar Mar 05 '23 13:03 arrowtype

@maphew you explained things well! Is it alright if I basically copy-paste your advice into the readme?

arrowtype avatar Mar 05 '23 13:03 arrowtype

@maphew you explained things well! Is it alright if I basically copy-paste your advice into the readme?

Oh, yes of course. Except, I have to recant, it's not good advice post 2020!

Use a partial clone instead:

git clone --filter=blob:none (url)

Shallow clones (using depth parameter) should only be used in "use and discard" scenarios such as Continuous Integration pipelines. Shallow is still the fastest cloning mechanism but unreliable for any later git work other than git pull.

Further reading: https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/

maphew avatar Mar 06 '23 05:03 maphew