layer5 icon indicating copy to clipboard operation
layer5 copied to clipboard

Large repo size: unwanted .pack files

Open leecalcote opened this issue 2 years ago • 17 comments

Current Behavior The layer5 repo is over 2GB in size due to unwanted .pack files in .git/objects/pack.

Desired Situation A smaller repo size.


Contributor Resources

The layer5.io website uses Gatsby, React, and GitHub Pages. Site content is found under the master branch.

leecalcote avatar Sep 07 '21 12:09 leecalcote

hi @leecalcote can I take this on?

Jordan-Rob avatar Sep 07 '21 15:09 Jordan-Rob

sure @Jordan-Rob, sorry missed the comment.

warunicorn19 avatar Sep 15 '21 06:09 warunicorn19

@leecalcote @warunicorn19 I think those pack files are not committed to the repo. And these files are required as part of git object database, https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

Ref: https://stackoverflow.com/questions/49535201/pack-file-remove-it-in-git

adithyaakrishna avatar Sep 25 '21 07:09 adithyaakrishna

ohh okay, so a big NO NO on deleting the .git/objects/pack files.

warunicorn19 avatar Sep 25 '21 08:09 warunicorn19

Yep, and I don't think it would matter as .git is not committed to the repo

adithyaakrishna avatar Sep 25 '21 14:09 adithyaakrishna

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 09 '21 18:11 stale[bot]

Hey @warunicorn19 @adithyaakrishna, can you please check out the https://github.com/18F/C2/issues/439 seems like we can reduce unwanted .pack files.

Aju100 avatar Dec 05 '21 02:12 Aju100

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 20 '22 05:01 stale[bot]

@adithyaakrishna @warunicorn19 any insights on this? And on @Aju100's approach?

Chadha93 avatar Jan 20 '22 05:01 Chadha93

@leecalcote @Chadha93 I want to take up this issue if no one is working on it rn.

Abhijay007 avatar Jan 25 '22 08:01 Abhijay007

All yours @Abhijay007

Chadha93 avatar Jan 25 '22 12:01 Chadha93

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 12 '22 23:03 stale[bot]

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

stale[bot] avatar Mar 16 '22 04:03 stale[bot]

After the site is built, node_modules and .cache take up some space, but both of these directories are .gitignored. The .pack files in the .git directory are the culprit.

--- /layer5 -------------------------------------------------------------------------
    2.4 GiB [##############] /.git
    1.0 GiB [######        ] /node_modules
  510.5 MiB [##            ] /.cache
  367.0 MiB [##            ] /src
  271.9 MiB [#             ] /public
   52.2 MiB [              ] /static
    1.8 MiB [              ]  package-lock.json
  428.0 KiB [              ] /.github
  324.0 KiB [              ] /.devcontainer
  196.0 KiB [              ] /content-learn
   20.0 KiB [              ]  gatsby-node.js
   20.0 KiB [              ]  CONTRIBUTING.md
   20.0 KiB [              ] /.vscode
   16.0 KiB [              ]  gatsby-config.js
   12.0 KiB [              ]  LICENSE
   12.0 KiB [              ]  README.md
   12.0 KiB [              ] /.husky
    8.0 KiB [              ]  .DS_Store
    4.0 KiB [              ]  package.json
    4.0 KiB [              ]  fonts.css
    4.0 KiB [              ]  .eslintrc.js
    4.0 KiB [              ]  GOVERNANCE.md
    4.0 KiB [              ]  .gitignore
    4.0 KiB [              ]  Makefile
    4.0 KiB [              ]  root-wrapper.js
    4.0 KiB [              ]  CODE_OF_CONDUCT.md
    4.0 KiB [              ]  Makefile.show-help.mk
    4.0 KiB [              ]  .babelrc
    4.0 KiB [              ]  .eslintignore
    4.0 KiB [              ]  gatsby-browser.js
    4.0 KiB [              ]  script.sh
    4.0 KiB [              ]  gatsby-ssr.js
    4.0 KiB [              ]  .gitattributes
    4.0 KiB [              ]  .env.development
    4.0 KiB [              ]  CODEOWNERS
    4.0 KiB [              ]  CNAME
 Total disk usage:   4.6 GiB  Apparent size:   4.2 GiB  Items: 139,693

leecalcote avatar Jul 30 '22 19:07 leecalcote

Hey, @leecalcote, @Chadha93 can I look into this issue ?

AaqilKrishna avatar Aug 27 '22 16:08 AaqilKrishna

@leecalcote I looked through the .pack files that we have in order to identify exactly what blobs are taking up so much space and as far as I can tell the cause are the assets or the media files that we use such as .png/.jpg/.mp4 files, running git gc --aggressive helps only a bit, and upon looking for ways to reduce the .pack files size the most recommended way is to actually get rid of the media entries from the repo and storing them elsewhere.

XDRAGON2002 avatar Sep 21 '22 19:09 XDRAGON2002

Thanks for looking into this, @XDRAGON2002. Yes, I agree. The size of the .git directory in the comment above reinforces this fact.

leecalcote avatar Sep 21 '22 20:09 leecalcote

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 12 '22 14:11 stale[bot]

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

stale[bot] avatar Dec 16 '22 05:12 stale[bot]

FYI @randychilau

leecalcote avatar May 16 '23 08:05 leecalcote

Hi @leecalcote,

Unfortunately the process to reduce/clean a repo does not seem straightforward or well documented for a public repo like Layer5.

I have outlined what I believe to be the phases required for this task, please let me know if there are any items missing, issues overlooked, or questions. It also seems this will require a fair amount of coordination and proper scheduling to execute, especially for the later phases.

I only have a basic understanding of git, so it would be great to have more experienced users review the information below.

Please include whoever else should be in this discussion.

Cheers, Randy


Note:

  • All of the repo changes can be tested on a clone and uploaded to a new Layer5 test repo for testing/review.

  • Using Git LFS seems to be a best practice for assets (e.g. image, video, zip files).

  • The big question is whether to upload the filtered clone and overwrite the existing repo (complex), or create a new repo to upload to (simple). Also there are logistics required in either case (e.g. issues, pull requests, comments, etc).

  • If you wish to upload a filtered clone to the existing repo, there are many considerations involved as described in the “DISCUSSION” section (points 4, 5, 6) of the filter-repo user manual. Here is one of them:

“People who cloned from the original repo will have old history. When they fetch the new history you force pushed up, unless they do a git reset --hard @{u} on their branches or rebase their local work, git will think they have hundreds or thousands of commits with very similar commit messages as what exist upstream (but which include files you wanted excised from history), and allow the user to merge the two histories, resulting in what looks like two copies of each commit. If they then push this history back up, then everyone now has history with two copies of each commit and the bad files have returned. You’re more likely to succeed in forcing people to get rid of the old history if they have to clone a new URL.”

  • Here is a glimpse at the potential final result for repo size:

before_after


Phase 1: Create a test filtered clone

  1. Remove all unused packages (using tools like depcheck, IDE find to double-check) and any files from assets and static folders that are not being used anymore (e.g. zip files, confirm with maintainers for actual clone)

  2. Remove untracked files and directories using git clean

  3. Install and configure Git LFS.

  4. move all large files and/or specified file types to Git LFS two methods:

  5. Utilize the filter-repo script which:

“Rapidly rewrite entire repository history using user-specified filters. This is a destructive operation which should not be used lightly; it writes new commits, trees, tags, and blobs corresponding to (but filtered from) the original objects in the repository, then deletes the original history and leaves only the new.”

-- Use `git clone --bare` for [copy of layer5](https://docs.github.com/en/repositories/creating-and-managing-repositories/duplicating-a-repository) and fetch LFS objects 
-- Run filter-repo script with `–analyze` flag, sample:

filter-repo2 -- Run filter-repo script with --invert-paths --paths-from-file ./filter-repo/analysis/path-deleted-sizes.txt

  1. Upload test filtered clone to a created Layer5 test repo.

Phase 2: Review test filtered clone for functionality, GitHub Actions, history, etc


If the following are approved and decided:

  1. Process for creating the test filtered clone

  2. Test filtered clone functionality and history

  3. Where to upload the final clone (existing or new repo)


Phase 3: Get the current repo in a finalized state to create filtered clone

  1. all open pull requests should be either closed or merged

“The git filter-repo tool and the BFG Repo-Cleaner rewrite your repository's history, which changes the SHAs for existing commits that you alter and any dependent commits. Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.” (src)

  1. Notify all current and potential contributors that the repo will be undergoing maintenance and there will be no access or activity to the repo while going through this process.

Phase 4: Create filtered clone and upload to the decided location

  1. Create backup of repository

  2. Go through the approved filtered clone creation process

  3. Upload clone to the decided location

  4. If this is a new location

    • transfer/migrate information (e.g. issues)
    • build site and make sure custom url is pointing to correct repo/branch and the site is live.

Phase 5: After upload, update contributors and relevant information

  1. Update CONTRIBUTION.MD and related files, text to include any instruction changes (e.g. using LFS)
  2. Notify contributors on actions to take to reconcile with the new repo (e.g. create from new clone url).

References:

randychilau avatar May 17 '23 22:05 randychilau

@Nikhil-Ladha

leecalcote avatar May 18 '23 06:05 leecalcote

@randychilau FYI - https://discuss.layer5.io/t/looking-for-a-difficult-git-challenge/2996

leecalcote avatar Jul 01 '23 03:07 leecalcote