brew icon indicating copy to clipboard operation
brew copied to clipboard

Download large repos faster using sparse checkouts with partial clones

Open hmarr opened this issue 3 years ago • 5 comments

Provide a detailed description of the proposed feature

Provide an option to perform a partial clone with a sparse checkout to limit time it takes to fetch content from large git repositories.

Specifically, we can the following git commands to fetch and checkout a subset of a repository:

git clone --no-checkout --filter=blob:none https://github.com/google/fonts
cd fonts
git config core.sparseCheckout true
echo ofl/ibmplexmono > .git/info/sparse-checkout
git checkout HEAD

This has been discussed before, and one of the issues that came up was that the git sparse-checkout command is relatively new and marked as experimental. However, sparse checkouts still work with older versions of git, they're just a little less pleasant to use. I've verified the commands above work with git 2.20, which appears to be the version shipped with macOS 10.15 (which is the oldest version homebrew supports).

I'm happy to try putting together a PR for this if it'd be something you'd be interested in accepting, but I figured I'd start with an issue so we could discuss appetite and approach.

What is the motivation for the feature?

Some formulae download content from large git repositories. The example I ran into recently was a font in the homebrew-cask-fonts repo. On my machine, cloning the full repo (google/fonts) took 1 minute 39 seconds and used 3.5 GB of disk space:

$ time git clone https://github.com/google/fonts
Cloning into 'fonts'...
remote: Enumerating objects: 58690, done.
remote: Total 58690 (delta 0), reused 0 (delta 0), pack-reused 58690
Receiving objects: 100% (58690/58690), 1.44 GiB | 19.82 MiB/s, done.
Resolving deltas: 100% (31536/31536), done.
Checking out files: 100% (12128/12128), done.

real	1m39.989s
user	1m34.537s
sys	0m37.132s

$ du -sm fonts
3451	fonts

However, the files needed by the cask are less than 2 MB:

$ du -sm fonts/ofl/ibmplexmono
2	fonts/ofl/ibmplexmono

Currently, this cask works around this problem by using the SVN download strategy, which in turn uses GitHub's subversion proxy. These days, macOS doesn't come with subversion installed by default, so you get prompted to install svn via homebrew. That seems like an unnecessary dependency, and some people have run into issues installing subversion on recent versions of macOS. Additionally, a subversion proxy isn't a standard feature of git hosts.

Using a partial clone with a shallow checkout, we can get the same benefits (fetching just the subset of the repository that we need) but using vanilla git. On my machine this took 3 seconds and used 10 MB of disk space, which is a big improvement over fetching the full repository.

$ time bash fetch.sh
Cloning into 'fonts'...
remote: Enumerating objects: 24853, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 24853 (delta 0), reused 4 (delta 0), pack-reused 24846
Receiving objects: 100% (24853/24853), 5.38 MiB | 22.10 MiB/s, done.
Resolving deltas: 100% (15013/15013), done.
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 18 (delta 0), reused 0 (delta 0), pack-reused 10
Receiving objects: 100% (18/18), 732.11 KiB | 3.66 MiB/s, done.
remote: Enumerating objects: 1, done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 1
Receiving objects: 100% (1/1), 72 bytes | 72.00 KiB/s, done.
Your branch is up to date with 'origin/main'.

real	0m3.073s
user	0m2.148s
sys	0m0.284s

$ du -sm fonts
10	fonts

$ ls fonts/ofl/ibmplexmono/
DESCRIPTION.en_us.html		  IBMPlexMono-Light.ttf		IBMPlexMono-SemiBoldItalic.ttf
IBMPlexMono-Bold.ttf		  IBMPlexMono-LightItalic.ttf	IBMPlexMono-Thin.ttf
IBMPlexMono-BoldItalic.ttf	  IBMPlexMono-Medium.ttf	IBMPlexMono-ThinItalic.ttf
IBMPlexMono-ExtraLight.ttf	  IBMPlexMono-MediumItalic.ttf	METADATA.pb
IBMPlexMono-ExtraLightItalic.ttf  IBMPlexMono-Regular.ttf	OFL.txt
IBMPlexMono-Italic.ttf		  IBMPlexMono-SemiBold.ttf	upstream.yaml

This would let use maintain the benefits of the current SVN

How will the feature be relevant to at least 90% of Homebrew users?

It'll reduce the dependency on svn for fetching large repos. The fonts casks are probably the most notable example of using svn for this optimisation, so at the very least it should mean that all users who install fonts are no longer required to install svn.

What alternatives to the feature have been considered?

  • Sticking with svn (downsides described in the motivation section)
  • Fetching the full repo (downsides also described in the motivation section)
  • Using shallow clones to speed things up further (these are expensive for GitHub, so I avoided mentioning them)

hmarr avatar Apr 25 '22 15:04 hmarr

I'm happy to try putting together a PR for this if it'd be something you'd be interested in accepting, but I figured I'd start with an issue so we could discuss appetite and approach.

Yeh, for the scope of doing this to replace those that use a SVN download strategy: this makes sense to me!

MikeMcQuaid avatar Apr 26 '22 09:04 MikeMcQuaid

macOS 10.15 (which is the oldest version homebrew supports).

Note that the download strategies should continue to work as far back as 10.10, though only system git back to 10.12 is actually used. That doesn't mean older versions need to have the benefits of sparse checkouts. They should just not be broken/error (can use Utils::Git.version checks where needed).

On Linux (not a concern for Cask, but is if we change the global download strategy), we support Git 2.7.0 and later, so that's actually a bigger range than macOS.

Bo98 avatar Apr 26 '22 13:04 Bo98

Thanks both! I'll pick this up when I get a sec. And thanks for the pointer on the macOS versions @Bo98 – I hadn't spotted that.

hmarr avatar Apr 27 '22 16:04 hmarr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar May 19 '22 00:05 github-actions[bot]

Easy there stalebot, I'll get the pull request finished soon!

hmarr avatar May 23 '22 14:05 hmarr

A laudable effort!

mrienstra avatar Sep 25 '22 19:09 mrienstra

I addressed the changes requested and opened a new PR as the old one got (rightfully!) closed out by stalebot.

hmarr avatar Oct 22 '22 15:10 hmarr

Closed by https://github.com/Homebrew/brew/pull/14035

apainintheneck avatar Dec 04 '22 22:12 apainintheneck