CCL icon indicating copy to clipboard operation
CCL copied to clipboard

Reduce the size of the repo

Open tmcclintock opened this issue 6 years ago • 12 comments

Currently, a fresh cloning of CCL yields an 86 MB repository. This is far too large for a code base that only contains theory code. The src, include and pyccl directories are 1.2 MB in total. doc is 25 MB, and pulling down other branches can inflate CCL well over 100 MB. Since we are developing the paper, can we get rid of the ccl_note? Additionally, it might be worth getting rid of the pre-compiled doxygen outputs and just including directions on how to make it.

Along these same lines, there are a ton of dead branches. Can we delete all branches that haven't been updated in, say, a year?

tmcclintock avatar May 25 '18 00:05 tmcclintock

Let's start by the old branches and any figures we don't need anymore (I'm sure there are old ones there). I think the note should stay, it has different info from the paper and we often go back to it. Let's see how we stand afterwards and if needed we'll get rid of the doxygen outputs.

elisachisari avatar May 25 '18 13:05 elisachisari

How are we doing on this these days? Are we happy with the current size of the repo? I can do another dead-branch-purge if we want.

c-d-leonard avatar Mar 20 '20 13:03 c-d-leonard

I think we should live with the current size

damonge avatar Feb 01 '21 18:02 damonge

Will close this (since we didn't see any opposition >1 year ago).

damonge avatar Jun 15 '22 11:06 damonge

I just had to clone the repo a couple of times and with the .git directory being ~180 MB, this is getting a bit of a hassle.

tilmantroester avatar Jun 15 '22 12:06 tilmantroester

But has it actually grown much recently? We stopped storing the note pdf, which as far as I know was the thing that was increasing its size significantly.

(I had to clone class earlier today, and it was a ~600 MB ordeal)

damonge avatar Jun 15 '22 12:06 damonge

That I don't know. CCL itself is about 60 MB, most of it in the benchmark and doc directories. How much of the 177 MB in .git is the master history and how much of it is stale branches might be good to find out.

tilmantroester avatar Jun 15 '22 13:06 tilmantroester

OK, let's reopen this with the aim of cleaning up the stale branches. Could the following devs check if the branches below are stale and should be deleted? If we don't receive an answer within the next ~month we will purge them.

@c-d-leonard : https://github.com/LSSTDESC/CCL/tree/ccl_paper_archived, https://github.com/LSSTDESC/CCL/tree/pr/825

@beckermr : https://github.com/LSSTDESC/CCL/tree/readthedocs_development, https://github.com/LSSTDESC/CCL/tree/cluster-count

@jablazek : https://github.com/LSSTDESC/CCL/tree/fastpt, https://github.com/LSSTDESC/CCL/tree/pt_debug

@matthewkirby : https://github.com/LSSTDESC/CCL/tree/concentration-mass-child

@jeremyneveu : https://github.com/LSSTDESC/CCL/tree/angpow4ccl

@sukhdeep2 : https://github.com/LSSTDESC/CCL/tree/correlation_issue743

@mishakb : https://github.com/LSSTDESC/CCL/tree/debug_benchmark

@anicola : https://github.com/LSSTDESC/CCL/tree/eff_hm

@t-ferreira : https://github.com/LSSTDESC/CCL/tree/omega_k_check

@vitenti : https://github.com/LSSTDESC/CCL/tree/add_gha_support

damonge avatar Jun 15 '22 13:06 damonge

Cleaning up branches won't really help. Those are pointers to commits. You'll need to do more to remove old junk.

beckermr avatar Jun 15 '22 13:06 beckermr

OK, I was naive. @c-d-leonard what did you mean by "dead-branch-purge" above?

(cleaning up the stale branches still seems like a good idea to me, so I think my previous message stands)

damonge avatar Jun 15 '22 13:06 damonge

See this stack overflow: https://stackoverflow.com/questions/45426297/how-can-i-know-if-git-gc-auto-has-done-something/64077241#64077241

Or see this: https://rtyley.github.io/bfg-repo-cleaner/

However, be super careful with these tools. You can permanently damage repo this way which is not good.

beckermr avatar Jun 15 '22 13:06 beckermr

Deleting stale branches won't affect the repo size, nor will it do (for the most part) if we only delete stuff from HEAD. The branches are merely pointers to specific commits that were made at some point in history. Since these files existed, the only way to remove them is to modify the commit history of the repo.

I performed an analysis of some of the big files that live (or have lived) in the repo, and below I show what we can probably get rid of. The current codebase is 62.5 MB but a full clone is 237.7 MB, which means 175.2 MB is taken up by history. I propose a way to drastically reduce the repo size, from the current 237M to a mere 30M (- 87%), with the mentality that we only keep in the repo, what is actually useful.

  1. pyccl/doc/: Deleting just the .pdf from the history will only save 23M. However, if we can completely remove doc/ (since I don't think anyone ever navigates to it, let alone compiles it), we can shave off half of its current size (121.4M).
  2. pyccl/examples/: These largely live in CCLX now, so they just waste space in the repo here. Removing them will shave off an additional 57M.
  3. pyccl/benchmarks/data/: Benchmarks can be very large in size, and it can only get worse the more we add in the (hopefully near) future. We can host them on an LSST website and make it so that when pytest is run they are automatically downloaded (via conftest.py - that's a simple thing to do). This can save us another 28.9M.

Here is the full breakdown of the sizes. Code refers to the codebase on HEAD, Bare refers to just the refs (git commit history), Full is code+bare, and Nobj is the total (code+bare) number of files associated with a particular path.

Breakdown                 Full      Bare     Code     Nobj
==========================================================
CCL/                    237.7M    175.2M    62.5M    28114
↪ `.../ccl_paper.pdf`    23.0M     22.8M     0.2M       62
↪ `./doc/`              121.4M     98.9M    22.4M     7268
↪ `./examples/`          57.1M     47.6M     9.5M      744
↪ `./benchmarks/data.`   28.9M      0.9M    28.0M      256
----------------------------------------------------------
CCL_reduced/             30.3M     27.8M     2.6M    19846

I created an exact replica of CCL (all branches + tags + releases + PRs) in a private repo, and spent some time writing a script that very carefully makes those changes, using the bfg tool (and also some additional improvements). I then verified that everything in the repo still works as expected, and that users with an old clone could keep working on their PRs with no problem. The script lives here (cleanup.sh) with detailed instructions of how to use. Should you want to test it yourself, reset.sh enables you to create your own CCL replica in another repo. The repo where I tested it is here.

I am adding this as a v3 milestone so we are able to ship a clean repo with the new release.

nikfilippas avatar Mar 29 '23 11:03 nikfilippas