blowfish
blowfish copied to clipboard
⚙️ minimize repo size
Describe the bug
My CI script fetches the submodule (blowfish theme) every time the build process is run.
The repo size is over 700MB, mostly due to the folders .git
, exampleSite
and public
To Reproduce Steps to reproduce the behavior:
- Check out a fresh copy of the repository
time git clone https://github.com/nunocoracao/blowfish blowfish-test
On my reasonably fast CI-VM, the process takes around 21 Seconds. - Check the directory sizes
du -h --max-depth 1 ./blowfish-test/
Total size: 736MB, biggest folders:32K ./blowfish-test/config 60K ./blowfish-test/.github 12K ./blowfish-test/archetypes 124K ./blowfish-test/static 221M ./blowfish-test/exampleSite 24K ./blowfish-test/data 326M ./blowfish-test/.git 4.7M ./blowfish-test/assets 169M ./blowfish-test/public 100K ./blowfish-test/i18n 16M ./blowfish-test/images 476K ./blowfish-test/layouts 736M ./blowfish-test/
.git
(326MB),exampleSite
(221MB) andpublic
(169MB). - Check the directory sizes of exampleSite
du -h --max-depth 1 ./blowfish-test/exampleSite/
Biggest folders:121M ./blowfish-test/exampleSite/content 28K ./blowfish-test/exampleSite/config 8.0K ./blowfish-test/exampleSite/archetypes 84M ./blowfish-test/exampleSite/resources 16K ./blowfish-test/exampleSite/data 17M ./blowfish-test/exampleSite/assets 24K ./blowfish-test/exampleSite/layouts 221M ./blowfish-test/exampleSite/
content
(121M),resources
(84MB) andassets
(17MB).
Expected behavior The build performance for CI builds -- and the general size of the repository -- could be improved by one of the following points:
- Remove the
public
folder from repository, it contains the generated webpage fromexampleSite
, that could easily be regenerated by runninghugo
. (removes 169MB - also remove the git history of this folder to shrink the.git
folder.) - Put the
exampleSite
folder in its own submodule, so a shallow checkout of the theme would be around half of the current size (removes 221MB, rewrite the git history to shrink the.git
folder as well.) - Exclude the folder
exampleSite/resources/_gen/images
from the repository, these files are generated by the hugo build process if needed (size improvement is at least 84MB). - Compress the image files in
exampleSite/assets/img
: convertingpng
photographs tojpg
would take up a lot less space.
Screenshots none.
Desktop (please complete the following information):
- OS: Linux
- Browser Chrome
- Version 117
Hugo & Blowfish versions Hugo 0.101.0, Blowfish latest commit from yesterday.
Additional context See recommended .gitignore file for hugo projects: Hugo.gitignore
I sent a note to this effect as well. reducing the repo to JUST the needful results in a collection which consumes about 26MB.
what I propose is creating a new 'blowfish core' as an isolated repo minus the examplesite, public, dirs... and the related git history and importing that... I'd proposed it as a thing that would coincide with the breaking change in I proposed here: https://github.com/nunocoracao/blowfish/discussions/936 seeing as implementing this would require a major version bump anyways, it felt like a "good" time to do something drastic like this... but yeah.... >700mb to 20mb is... substantially different performance-wise in the ci universe
+1 for reducing the repo. It's really the pain, 700mb for theme!
+1 for optimizing the repo's size, especially since the upstream version in Congo is <40 MB whereas Blowfish's is nearly 750MB.
Basically all of the bloat is due to this project's additions to exampleSite
and public
.
EDIT (2024-07): A sparse clone that excludes these extraneous files takes <2 MB of network activity, <1 second to clone/checkout, and only a few MB of disk usage.
git clone --filter=blob:none --no-checkout --depth=1 --sparse https://github.com/nunocoracao/blowfish.git
cd blowfish
printf '/*\n!exampleSite/*\n!images/*\n!assets/img/*\n!blowfish_logo.png' > .git/info/sparse-checkout
git checkout
It would be a better approach to transfer exampleSite
and public
to a new repo.
technically false… because the .git directory/history remains in the repo, and that’s >half the problem.
extracting the core to a new repo would be the least painful way to address that.
git repo surgery is a PITA.
it’s a breaking change almost however you cut it, as any downstream repo / clone will encounter problems with history being rewritten
Hence why I was proposing it coincide with my other breaking change of robustifying the authors construct..
Rewriting git history isn't too complicated, especially if you're removing whole directories, but yeah it would probably be better to move or break off the main theme files elsewhere to avoid breaking everyone else's copies.
You could even maintain stars and such by just keeping this one as the main repo, but with a note that you can clone another one for just the theme alone.
keeping this one as the main repo, but with a note that you can clone another one for just the theme alone.
A 1:1 repo like blowfish-lite
that would require @nunocoracao to maintain two copies of the same codebase… more toil, greater chance for divergence.
what feels least bullshitty to me is having a core repo that has the theme content which this repo consumes…. but wth do i know? :)
that would require @nunocoracao to maintain two copies of the same codebase… more toil, greater chance for divergence.
what feels least bullshitty to me is having a core repo that has the theme content which this repo consumes
I was thinking something similar, but even if they were completely separate, you could lazily sync the theme into the main repo on every commit with a GitHub action.
implementation details :) lots of ways to accomplish it… each has its own bag of bullshit… i don’t wanna be too prescriptive of the how… just wanting to amplify the legitimacy of the request
Hey @fuse314 @wolfspyre @ragibson @chromer030 Thanks for all the feedback. I am definitely interested in improving this ASAP. From the thread I got a couple of actions:
- [x] Remove public folder from repo - this one should be done
- [ ] Clean up git history - I am n00b on this topic, I would probably clean up everything but not sure about the impacts of that. Anyone can help or has opinions on how to reduce the .git size
- [ ] split the exampleSite from the main repo - I will have a think about this one. Maybe there is a way to drastically reduce the size of the content in that folder so that it can remain in OR I will look at how to split it into it's seperate repo just for documentation.
Small update - with the last changes already reduced the repo size from 736M to 546M (25% reduction). Will keep exploring the other two options a little more before committing to a solutions.
@nunocoracao if you wanna take a look at what I'm doing async:
I mirror this repo:
https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish
The scripts here rip the history around quite a bit... might be useful to play with: https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-wrangler
The output: https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-thin
/tmp$ git clone https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-thin
Cloning into 'blowfish-thin'...
warning: redirecting to https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-thin.git/
remote: Enumerating objects: 5704, done.
remote: Counting objects: 100% (2322/2322), done.
remote: Compressing objects: 100% (1489/1489), done.
remote: Total 5704 (delta 831), reused 2322 (delta 831), pack-reused 3382
Receiving objects: 100% (5704/5704), 36.85 MiB | 38.12 MiB/s, done.
Resolving deltas: 100% (2997/2997), done.
/tmp$ du -sh blowfish-thin/
42M blowfish-thin/
/tmp$ cd blowfish-thin
/tmp/blowfish-thin (wpl_main)$ ls
CODE_OF_CONDUCT.md README.md config.toml i18n package-lock.json tailwind.config.js
CONTRIBUTING.md archetypes data layouts package.json theme.toml
FUNDING.yml assets firebase.json lighthouserc.js processUsers.js
LICENSE config go.mod netlify.toml static
/tmp/blowfish-thin (wpl_main)$ rm -rf .git
/tmp/blowfish-thin$ du -sh .
4.4M .
versus:
/tmp$ git clone https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish
Cloning into 'blowfish'...
warning: redirecting to https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish.git/
remote: Enumerating objects: 23670, done.
remote: Total 23670 (delta 0), reused 0 (delta 0), pack-reused 23670
Receiving objects: 100% (23670/23670), 358.79 MiB | 40.04 MiB/s, done.
Resolving deltas: 100% (11428/11428), done.
Updating files: 100% (1864/1864), done.
/tmp$ du -sh blowfish/
815M blowfish/
/tmp$ cd blowfish
/tmp/blowfish
/tmp/blowfish (wpl_main)$ rm -rf .git
/tmp/blowfish$ du -sh .
445M .
I'm not gonna assert this is perfect :) it's kludgey... and a bit brittle... but it works for the moment.... an it might be helpful as exploratory POC. Will reach out in email.
@wolfspyre checked your solution but It seems that it's not possible to change history and use the same git repo as things become incompatible right?
@nunocoracao it’s possible for sure, but every path comes with caveats… when you rewrite history, it invalidates others’ versions of it.. :)
so it’s something that needs to be done in a coordinated and clear fashion..
the least messy way forward (IMO) is likely a clean repo for ‘BLOWFISH CORE’ or something that blowfish imports… saves you the hassle of rewriting history … gives you a blank slate for tomorrow… keeps existing repo around… (shouldn’t ) make anyone adjust their existing tooling
but that then requires the plumbing which slurps blowfish core into blowfish anytime core changes…
alternatively,
- create ‘blowfish-legacy’ (or something)
- push current blowfish repo to that,
- prune stuff
- rewrite hisfory
- force push to blowfish…
this would mean anyone consuming the repo would have to manually twiddle git, as history changed…
something like this would coincide well with a major version change…
people that don’t want to follow along can switch their upstream from blowfish to blowfish-legacy and be insulated from any breaking changes
there’s many ways forward, each comes with some nuance and sticky spots…
there’s a few REALLY BAD IDEA ways forward, but barring those, most of the options are viable for a given set of constraints… which makes the most sense depends on your unique needs/preferences as much as the technical requirements/limitations yknow?
@wolfspyre not really comfortable with messing with the git history. Meanwhile, trimmed it down again from 553mb to 460mb by reducing image sizes
Is it worth looking at using .webp image formats over .jpg and .png? Ive started implementing this on my own site and works great.
Is it worth looking at using .webp image formats over .jpg and .png? Ive started implementing this on my own site and works great.
I'd note that the old versions of the images will still be stored in .git's bookkeeping, but that would presumably help on shallow clones.
@fuse314 @wolfspyre Speaking of which, in a CI environment you should probably be cloning with --depth=1
to only include history truncated to the most recent commit. That'll cut the current repo size from ~470MB to ~145MB.
something to keep in mind.... git history keeps EVERYTHING that EVER was in the repo.... every time you convert an asset, you're increasing the repo size by that much...
This is part of the reason why having the public version of the site in the repo is problematic; due to the asset hashing/ fingerprinting, every new version has almost a full copy of the public site and all its' images...
/tmp$ git clone https://github.com/nunocoracao/blowfish.git
Cloning into 'blowfish'...
remote: Enumerating objects: 24053, done.
remote: Counting objects: 100% (24053/24053), done.
remote: Compressing objects: 100% (10846/10846), done.
remote: Total 24053 (delta 11552), reused 23709 (delta 11441), pack-reused 0
Receiving objects: 100% (24053/24053), 379.26 MiB | 20.45 MiB/s, done.
Resolving deltas: 100% (11552/11552), done.
/tmp$
/tmp$ du -sh blowfish/; cd blowfish
457M blowfish/
/tmp/blowfish (main)$
/tmp/blowfish (main)$ git-filter-repo --analyze
Processed 14043 blob sizes
Processed 1505 commits
Writing reports to .git/filter-repo/analysis...done.
/tmp/blowfish (main)$
/tmp/blowfish (main)$ head -13 .git/filter-repo/analysis/directories-all-sizes.txt
=== All directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
1109829186 860493241 <present> <toplevel>
549038949 513229890 <present> exampleSite
279498292 268756950 <present> exampleSite/content
174077004 165795885 2023-10-15 public
162893073 155468174 2023-10-15 exampleSite/resources/_gen/images
162893073 155468174 2023-10-15 exampleSite/resources/_gen
162893073 155468174 2023-10-15 exampleSite/resources
131061836 124922519 <present> exampleSite/content/docs
113300504 110829169 2023-10-15 public/docs
103408455 102654615 2023-10-15 exampleSite/resources/_gen/images/docs
100530759 58280398 2022-10-02 docs
/tmp/blowfish (main)$
/tmp/blowfish (main)$ head -10 .git/filter-repo/analysis/directories-deleted-sizes.txt
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
174077004 165795885 2023-10-15 public
162893073 155468174 2023-10-15 exampleSite/resources/_gen/images
162893073 155468174 2023-10-15 exampleSite/resources/_gen
162893073 155468174 2023-10-15 exampleSite/resources
113300504 110829169 2023-10-15 public/docs
103408455 102654615 2023-10-15 exampleSite/resources/_gen/images/docs
100530759 58280398 2022-10-02 docs
46014364 35717600 2022-09-12 exampleSite/docs
/tmp/blowfish (main)$
/tmp/blowfish (main)$ rm -rf .git
/tmp/blowfish$ du -sh .
72M .
Yes, webp is marginally better compression-wise, but if stuff's already reasonably compressed /sized, the full replication of the asset in history likely outweighs any gains in asset size from compression...
now, for NEW assets, certainly worth exploring, but I leave that decision to @nunocoracao ;)
@ragibson
my ci already is doing so, (plus I haz local mirror of repo so its less of an issue (FOR ME) but that's beside the point of curbing the bloat, which @nunocoracao 's already substantially impacted ( <3 ) rolling forward... now it's a simple question of where else the juice is worth the squeeze ;)
Dumb question probably. Is there any way you "manipulate" the git info and still use this same repo?
Dumb question probably. Is there any way you "manipulate" the git info and still use this same repo?
So, it's really more that rewriting git history will break anyone else's checkout/clone of the repo, though it is easy enough to fix on their end for an experienced user by pruning the git repo or simply recloning.
You could mess with history all you want and still use this repo IF that were an acceptable result (it's probably not).
One other strategy is to clean up the project and then use git replace to transparently ignore older files from the git history unless they are absolutely needed. That does not rewrite history and would reduce the size of git clones tremendously, but it is definitely a more advanced operation and comes with its own series of gotchas. See something like https://stackoverflow.com/a/17622991 for more details.
Thanks @ragibson super appreciate the help. And also sorry everyone, this is mainly due to me including a bunch of needed folders initially in the repo which were deleted several versions ago.
@ragibson is there a safe way for me to test these solutions in a separate repo - e.g. forking Blowfish and then trying out these git operations in it? One of my concerns is the risk involved in f-ing this up for everyone.
Sure -- I don't think GitHub will let you fork your own repo, but you can try
- making a new repo and adding some noticeably large files (or just clone this one and push it to a second repo for testing)
- remove those files in a commit
- notice that the repository is still large on a fresh clone/checkout
- play around with truly erasing the files from the git history, etc.
You're right that I wouldn't recommend experimenting on the production repository itself
On Thu, Oct 26, 2023, 2:45 PM Nuno Coração @.***> wrote:
Thanks @ragibson https://github.com/ragibson super appreciate the help. And also sorry everyone, this is mainly due to me including a bunch of needed folders initially in the repo which were deleted several versions ago.
@ragibson https://github.com/ragibson is there a safe way for me to test these solutions in a separate repo - e.g. forking Blowfish and then trying out these git operations in it?
— Reply to this email directly, view it on GitHub https://github.com/nunocoracao/blowfish/issues/980#issuecomment-1781687388, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7WIHDOI2LEPUKMBTPUKDYBKVULAVCNFSM6AAAAAA5OTBHTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGY4DOMZYHA . You are receiving this because you were mentioned.Message ID: @.***>
I think you can just make a new Repo, let's call it "newtestrepo", on Github.
Clone this blowfish repo to a local folder "newtestrepo".
Change the git remote url to the newtestrepo git remote set-url origin https://github.com/user/newtestrepo.git
I don't know if the .github
folder does unexpected things, or if you have to set up any processes manually...
Push the repo to the new location with git push origin
.
Then, rewrite the history and git push origin
(probably with the "force" option) the changes to the test repo.
Freshly clone the test repo into another local folder and check the new folder size.
While this doesn't address the root issue, my current workaround for the CI pipeline is to simply download Blowfish's latest release archive. The archive does include exampleSite/
, but not the public/
dir and git history.
For the latest release (v2.44.0), it's a 67mb download and 72mb unarchived. The pipeline step only takes 3s on a default GitHub runner.
curl -o blowfish.zip -L $(curl -s https://api.github.com/repos/nunocoracao/blowfish/releases/latest | jq -r '.tarball_url')
tar --one-top-level=themes/blowfish --strip-components=1 -xzf blowfish.zip
EDIT: I just realized the downloaded archive probably gets included in the deployment to e.g. Firebase Hosting (unless you ignore it in firebase.json
). So you'll need to either delete it after unarchiving, or better yet here's a one-liner that doesn't write it to disk:
tar --one-top-level=themes/blowfish --strip-components=1 -xzf <(curl -Ls $(curl -s https://api.github.com/repos/nunocoracao/blowfish/releases/latest | jq -r '.tarball_url'))
I'm probably late to the discussion but I'd suggest cloning blowfish as a shallow submodule rather than directly. Consider the following repo where I also use blowfish. You can clone it with the modules using --recurse-submodules
:
git clone https://github.com/madoke/madoke.org.git blowfish-test --recurse-submodules 5089 23:31:55
Cloning into 'blowfish-test'...
remote: Enumerating objects: 1461, done.
remote: Counting objects: 100% (713/713), done.
remote: Compressing objects: 100% (370/370), done.
remote: Total 1461 (delta 283), reused 676 (delta 252), pack-reused 748
Receiving objects: 100% (1461/1461), 36.68 MiB | 10.11 MiB/s, done.
Merge branch 'main' of github.com:madoke/madoke.org
Resolving deltas: 100% (413/413), done.
Submodule 'themes/blowfish' (https://github.com/madoke/blowfish) registered for path 'themes/blowfish'
Cloning into '/Users/madoke/work/blowfish-test/themes/blowfish'...
remote: Enumerating objects: 17536, done.
remote: Counting objects: 100% (1209/1209), done.
remote: Compressing objects: 100% (547/547), done.
remote: Total 17536 (delta 676), reused 1139 (delta 634), pack-reused 16327
Receiving objects: 100% (17536/17536), 373.44 MiB | 19.53 MiB/s, done.
Resolving deltas: 100% (9459/9459), done.
Submodule path 'themes/blowfish': checked out '96cbca1d4d2ce7dddbdae5ea940d749aa16929a6'
Checking the size reveals that the latest version of blowfish takes only 73M, which I guess is already significantly small due to previous efforts:
du -sh blowfish-test/themes/ 5090 23:32:27
73M blowfish-test/themes/
The key thing here is that the .git
folder containing the history is not pulled entirely, as we can see this one takes 4K while cloning blowfish directly will pull the entire history which takes 400M+
du -sh blowfish-test/themes/blowfish/.git 5091 23:32:43
4.0K blowfish-test/themes/blowfish/.git
Hope this helps anyone !
Not sure if it's easier to just use hugo module instead so we don't need to deal with the .git
?
Small migration work will be needed for users of course.
current submodule:
--- /private/tmp/mynewsite ------------------------------
148.4 MiB [##################################] /themes
81.2 MiB [################## ] /.git
28.0 KiB [ ] /config
4.0 KiB [ ] /archetypes
4.0 KiB [ ] .gitmodules
4.0 KiB [ ] hugo.toml
(...omitted)
hugo mod (gathered with $ hugo config | grep cachedir
)
--- /Users/<redacted>/Library/Caches/hugo_cache/modules/filecache/modules/pkg/mod/github.com/nunocoracao/blowfish/[email protected] -----
/..
93.3 MiB [##################################] /assets
47.2 MiB [################# ] /exampleSite
6.2 MiB [## ] /images
520.0 KiB [ ] blowfish_logo.png
480.0 KiB [ ] /layouts
(...omitted)
To reduce the size further:
- Get rid of
exampleSite
like mentioned by many others above -
mermaid
in theassets/lib/
takes 89.8 MiB. After removing*.js.map
source maps it comes down to 55.3 MiB. Haven't check how the packages are pulled but should be rooms for improvements? (we can of course just leverage CDN but I personally preferred the self-contained way for assets)
2. `mermaid` in the `assets/lib/` takes 89.8 MiB.
I just noticed this same thing -- lib/mermaid
takes up ~96% of the entire assets folder. Was it bundled incorrectly?
@fuse314 @wolfspyre Speaking of which, in a CI environment you should probably be cloning with
--depth=1
to only include history truncated to the most recent commit. That'll cut the current repo size from ~470MB to ~145MB.
Comparing to my comment last October, the full repo size has increased to ~623 MB with a depth=1
clone being ~220 MB.