Reduce Repository Size
Is there an existing issue?
- [X] I have searched the existing issues
Experiencing problems? Have you tried our Stack Exchange first?
- [X] This is not a support question.
Description of bug
The Substrate repository has gotten very big, so that other crates/repos depending on it need to download lots of data when building. It would be great to investigate how to reduce the size so clones can be faster.
Steps to reproduce
E.g. clone https://github.com/open-web3-stack/open-runtime-module-library and run cargo test, it will pull in a patched version of substrate that will be downloaded completely.
The root problem is actually cargo, more specifically https://github.com/rust-lang/cargo/issues/1171
Because cargo doesn't support shallow clones it downloads the whole repo instead. Shallow clone takes a few seconds, full clone takes ~15 minutes for us in CI. As mentioned in that thread the root issue libgit2 limitation and gitoxide (git implementation in Rust) is not feature rich enough to replace it.
Sure, the root problem is cargo. But as the root problem might take a while to get fixed it would IMO be good to investigate which mitigations can be done on the repo size. E.g. I remember that there was talk that the substrate and polkadot repos got a bunch heavier because of the docs being in branches that are getting quite heavy.
See git-sizer run here (the repo is 43GB :warning: ):
Processing blobs: 1004964
Processing trees: 248480
Processing commits: 24336
Matching commits to trees: 24336
Processing annotated tags: 38
Processing references: 1270
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Blobs | | |
| * Total size | 43.1 GiB | **** |
| | | |
| Biggest objects | | |
| * Trees | | |
| * Maximum entries [1] | 5.08 k | ***** |
| * Blobs | | |
| * Maximum size [2] | 60.5 MiB | ****** |
| | | |
| Biggest checkouts | | |
| * Number of directories [3] | 32.7 k | **************** |
| * Maximum path depth [4] | 12 | * |
| * Maximum path length [4] | 189 B | * |
| * Number of files [5] | 592 k | *********** |
| * Total size of files [6] | 9.14 GiB | ********* |
[1] e20c999e130527d0e60a15629a5997d4dc95cc68 (refs/remotes/origin/gh-pages:crate-docs/libc)
[2] b53d9a26bc65fb3465cb78639f0f90d939fbda65 (902a8ccf81aa3543f2aa6b455360efdcd9a790a2:substrate/state-machine/core)
[3] 4ac0009a28b5eb2894e8b7f1704e99b559faf230 (edb401e7f56a58ec9c62274e34763e8d7ec54d6a^{tree})
[4] 5ea3249b2c8491566eafc36a25984bc51ea5bc5a (refs/remotes/origin/gh-pages^{tree})
[5] af51f47d134a1eda17e4ad651528acba659ecd81 (68c6deea68f58da7f63686f28bf1609e68dcfd44^{tree})
[6] 5e00508c9e93c0444107cee65715d3f8a9c86af9 (542dc2f477a11a2c45a396ec6a10bf8a80f2cad3^{tree})
CC @TriplEight
gh-pages should be moved to another repo or at least made it so that it overwrites a single commit instead of pushing a new one each time. I've tried changing the script to do so a while ago, but could not test it on my machine. Some of the generated doc files paths only differ by case, which makes that branch impossible to work with on an case-insensitive file system, such as APFS.
Ack., thanks. I've seen the same problem w Polkadot repo. A simple git clone downloads 1.6 GB at the moment. I'll see what I can do.
We really need a solution to this. I am doing a cargo check and it is downloading repos and I have already finished my coffee and it still downloading.
What I ended up doing is just using our fork of Substrate repo with most branches (including gh-pages) removed. This helped a lot with build times.
Any news on this?
@paritytech/ci anyone could look at this?
Linking https://github.com/github/git-sizer
It has some good suggestions on how to go about reducing repo size.
Recreated gh-pages from scratch as a quickfix. Need to investigate what else can be removed.
It seems that recreating gh-pages reduced size to 8 GB, so now it looks like this:
git-sizer output
Processing blobs: 235799
Processing trees: 462208
Processing commits: 72647
Matching commits to trees: 72647
Processing annotated tags: 46
Processing references: 9164
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 72.6 k | |
| * Total size | 42.7 MiB | |
| * Trees | | |
| * Count | 462 k | |
| * Total size | 192 MiB | |
| * Total tree entries | 5.42 M | |
| * Blobs | | |
| * Count | 236 k | |
| * Total size | 8.25 GiB | |
| * Annotated tags | | |
| * Count | 46 | |
| * References | | |
| * Count | 9.16 k | |
| * Branches | 553 | |
| * Tags | 117 | |
| * Remote-tracking refs | 3 | |
| * Pull request refs | 8.49 k | |
| * Other | 1 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 154 KiB | *** |
| * Maximum parents [2] | 3 | |
| * Trees | | |
| * Maximum entries [3] | 307 | |
| * Blobs | | |
| * Maximum size [4] | 60.5 MiB | ****** |
| | | |
| History structure | | |
| * Maximum history depth | 6.87 k | |
| * Maximum tag depth [5] | 1 | |
| | | |
| Biggest checkouts | | |
| * Number of directories [6] | 4.00 k | ** |
| * Maximum path depth [7] | 12 | * |
| * Maximum path length [8] | 181 B | * |
| * Number of files [6] | 23.3 k | |
| * Total size of files [6] | 1.02 GiB | * |
| * Number of symlinks [9] | 5 | |
| * Number of submodules [10] | 1 | |
[1] 12b306d0c9b641d99ddf8024940a5687c284ae6d (refs/pull/7044/head)
[2] 342e03514ee6029d501cefacc487decda00af5ea
[3] 319d09c0fb9d7b951ffd5daaf07db93fb5e8beb8 (refs/heads/gh-pages:crate-docs/kitchensink_runtime)
[4] b53d9a26bc65fb3465cb78639f0f90d939fbda65 (902a8ccf81aa3543f2aa6b455360efdcd9a790a2:substrate/state-machine/core)
[5] aa730731c075a93eaed64fe3c8057a509c8de6a8 (refs/tags/ci-release-2.0.0-alpha.5+3)
[6] be86c137eb44f354247d250feca9be16d02a67ef (refs/heads/gh-pages^{tree})
[7] 92c0ffd01f7d2dac7e3328ff7be84d4d765dc18d (08de8b323232821b7df6e830e203ee8102ba3437^{tree})
[8] 615cdf042f9a186776ff8dabd9b24178403d7ef1 (0132128ed55a44210f0431bdc15275b2b06470fb^{tree})
[9] 78787c6427e8a298fd5814a1ae94a08005b450c2 (refs/pull/1002/head^{tree})
[10] f5696c5b02f9d0b1c320b5106e93bfdce2553121 (refs/pull/9847/head:frame)
The way how I calculated it:
mkdir substrate
git clone --mirror https://github.com/paritytech/substrate.git substrate/.git
cd substrate
git config --unset core.bare
git checkout master
for branch in $(git --no-pager branch); do git checkout $branch; git checkout master;done
git-sizer --verbose
Other heavy objects need investigation