Lean git clone – move large files to LFS
Cloning the Playground repo can be slow when done without --depth=1 --shallow-clone because git will download the full history, including all historical .wasm builds and WordPress builds. One way to make it smaller is rewriting the history to move all the large files to LFS. Now, I don't like the idea of rewriting the history, but we could create a new branch off trunk and rewrite it to effectively have a historical, large trunk and a lean new main. We could then change the main branch to main.
Major kudos to @brandonpayton @JanJakes @zaerl @mho22 @akirk @fellyph for brainstorming these ideas
Here's some commands from GPT we may be able to use for this process:
LFS is controlled by .gitattributes, which lives in each branch. So a branch can track files with LFS while others don’t.
What that means in real life:
- If you add
.gitattributesonmainand rewrite onlymainto move big files to LFS pointers, then setmainas the only active branch (delete/hide the rest and heavy tags), a defaultgit clonebecomes lean. - If you keep old branches/tags that still reference big blobs, they remain reachable, and default clones still pull the weight—LFS-on-one-branch won’t fix that.
How to do “branch-only LFS” cleanly:
- On
main, add tracking rules:
git checkout main
git lfs install
git lfs track "*.png" "*.jpg" "*.zip" "*.pdf" "*.mp4"
git add .gitattributes
git commit -m "Track large files with LFS on main"
- Rewrite just
mainhistory to move existing large files to LFS:
git lfs migrate import --include="*.png,*.jpg,*.zip,*.pdf,*.mp4" --include-ref=refs/heads/main
git push --force origin main
- Make the clone small without user flags:
- Delete stale branches and heavyweight tags you don’t need:
# example
git push origin :old-big-branch
git tag -d v0.1 v0.2; git push origin :refs/tags/v0.1 :refs/tags/v0.2
- If you must keep deep archaeology, move it to a separate “-archive” repo.
Here is the list of first 50 biggest blobs in the repo:
$ git rev-list --all --objects | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 -r | head -n 50 | awk '{ printf "%.2f MB\t%s\t%s\n", $2/1048576, $1, $3 }'
60.74 MB 319bae1b70c52c83c08ab4317425547d6d632aca packages/playground/website/playwright/deploy-e2e-mid-release.zip
59.36 MB 46bc032c2388082b25e70b7cee05107234c4e9a6 packages/playground/website/playwright/deploy-e2e-old-release.zip
59.36 MB da95ec8afd19d1bec162732eb331b58ad4c8cd8f public/php-web-wordpress.data
47.16 MB a4994db7a8732d2f4f1d66772ef7b9c5045b07cb public/php-web-wordpress.data
43.52 MB e325d4606154ac4d8b5676f7cfe25b2aed2accd0 dist-web/wp.data
35.23 MB ff34fd9ad43ce03075e5429eb16b4b4c4bbbd957 packages/php-wasm/node/asyncify/8_4_11/php_8_4.wasm
35.23 MB 5c4af57888c52efc934a6515170e186dd5d2a2ae packages/php-wasm/node/asyncify/8_4_11/php_8_4.wasm
35.23 MB e087fc134dcf84b463fbfe360972be2fe18c8a39 packages/php-wasm/node/asyncify/8_4_10/php_8_4.wasm
35.23 MB 56b80b5c4dfc0126b137a675d8e9d062e3f999e9 packages/php-wasm/node/asyncify/8_4_10/php_8_4.wasm
35.08 MB a7733707cae14bdcd0a562601ed388a120082d8f packages/php-wasm/node/asyncify/8_4_10/php_8_4.wasm
35.06 MB 3fd69ae787eb1d9e5cd74204d591802e67563603 packages/php-wasm/node/asyncify/8_4_0/php_8_4.wasm
35.06 MB 1a6d8e4c149cb11bc559d7acac882d474d3bee2e packages/php-wasm/node/asyncify/8_4_0/php_8_4.wasm
35.06 MB 5a2b1ae650f24a6e0c7717a2f9b7dfd667c6ca89 packages/php-wasm/node/asyncify/8_4_0/php_8_4.wasm
35.06 MB 36329c9a671937bacbc5625066548ab268dee214 packages/php-wasm/node/asyncify/8_4_0/php_8_4.wasm
35.04 MB 59c6eca519e3b39cb009c9dba6b9e56c9ec91a54 packages/php-wasm/node/asyncify/8_4_0/php_8_4.wasm
34.69 MB 99e4cfe8ac6434a90c36dec80040707b5e92e945 packages/php-wasm/node/jspi/8_4_11/php_8_4.wasm
34.69 MB e5071cf59da18b29d59d6319eb77348c72a86614 packages/php-wasm/node/jspi/8_4_10/php_8_4.wasm
34.69 MB 62b14f231be1f06855012a84385999cef3a54105 packages/php-wasm/node/jspi/8_4_10/php_8_4.wasm
34.53 MB e972a4e032d3d6088bb2893033a20054233e5a61 packages/php-wasm/node/jspi/8_4_0/php_8_4.wasm
34.53 MB 54558ec4eb08b83282887e9c0b89b20f9e370995 packages/php-wasm/node/jspi/8_4_0/php_8_4.wasm
34.52 MB 8df64949d7090f4ea5507bdc2c872def8506e33e packages/php-wasm/node/jspi/8_4_0/php_8_4.wasm
34.52 MB 609f6ac16dcab892440b076c8e7cd4d336b916af packages/php-wasm/node/jspi/8_4_0/php_8_4.wasm
34.52 MB 551312cebe09ca468e6baed2b26501eaf8dfcc4b packages/php-wasm/node/jspi/8_4_0/php_8_4.wasm
31.61 MB 90f8044bca3f7ddeb0d30efd75e2c880a84c1687 packages/php-wasm/node/asyncify/8_3_24/php_8_3.wasm
31.61 MB cc1220cfc821735a7ea7d499c542fc9a7754faaa packages/php-wasm/node/asyncify/8_3_23/php_8_3.wasm
31.61 MB 5447a59427e891ba4bd9bc332f43bac33de6f17f packages/php-wasm/node/asyncify/8_3_23/php_8_3.wasm
31.61 MB 849bfa3926b8324b2df7cac02d900691d9110bb1 packages/php-wasm/node/asyncify/8_3_24/php_8_3.wasm
31.46 MB e04b47ec547168d8fec15dc7deada2108fe303ee packages/php-wasm/node/asyncify/8_3_23/php_8_3.wasm
31.42 MB 22d39bf945071c5c489f565694e3158dfe084aa1 packages/php-wasm/node/asyncify/8_3_0/php_8_3.wasm
31.42 MB 0d686f46cc14b2ce4cbffbd4c892f07ac8e1d04f packages/php-wasm/node/asyncify/8_3_0/php_8_3.wasm
31.42 MB 0c716f86d110652a0c6060551755a4a8ebcf0898 packages/php-wasm/node/asyncify/8_3_0/php_8_3.wasm
31.42 MB 6b58c5f2931250f8a0629ddd209612dad1bfdd3f packages/php-wasm/node/asyncify/8_3_0/php_8_3.wasm
31.41 MB 40dce2b700f0494c2babfe46724076073c494b91 packages/php-wasm/node/asyncify/8_3_0/php_8_3.wasm
31.06 MB 76fd277e7da8a6b01215f5c6f79515f680163964 packages/php-wasm/node/jspi/8_3_24/php_8_3.wasm
31.06 MB cf6928345bb911c59d7a055b93fcd36448fc709a packages/php-wasm/node/jspi/8_3_23/php_8_3.wasm
31.06 MB 8d6e91e59c9fb43d9216b26e0f36575a0c2fdc23 packages/php-wasm/node/jspi/8_3_23/php_8_3.wasm
30.89 MB f2111b5b70b1415d9fb81a119a3a03a00adb6ee9 packages/php-wasm/node/jspi/8_3_0/php_8_3.wasm
30.89 MB bcab5513c495df82825548be6d86d6300889be20 packages/php-wasm/node/jspi/8_3_0/php_8_3.wasm
30.89 MB b9d44485d08e0652cdec2038a5c0d62fdffa2ebe packages/php-wasm/node/jspi/8_3_0/php_8_3.wasm
30.89 MB 96ba3b538ce989fe08b954a2cf18bfc1da079cd5 packages/php-wasm/node/jspi/8_3_0/php_8_3.wasm
30.89 MB 16ab71951ca0168d79ae3a94e7f1c46c762112e0 packages/php-wasm/node/jspi/8_3_0/php_8_3.wasm
30.56 MB b74fd68e7c3f5974e4a7ef4a1561d57852381be9 packages/php-wasm/node/asyncify/8_2_29/php_8_2.wasm
30.56 MB 9f482d567bf2b29f5542b74a6abe1e414cc68fc6 packages/php-wasm/node/asyncify/8_2_29/php_8_2.wasm
30.56 MB 819070471c90db93a49fc98ffbb1d69a98506325 packages/php-wasm/node/asyncify/8_2_29/php_8_2.wasm
30.56 MB 60f5af5ec0ed3affa337c600b14ee104876baa7d packages/php-wasm/node/asyncify/8_2_29/php_8_2.wasm
30.42 MB 3b615c7a0511f5396d470dd656bdb63ff988c3ed packages/php-wasm/node/asyncify/8_2_29/php_8_2.wasm
30.38 MB 5937a8036f5d72ae9fa012989293c3fee015988b packages/php-wasm/node/asyncify/8_2_10/php_8_2.wasm
30.38 MB a9a1e2cfe804f23b83b1c6c8df7dbc61b602ee9d packages/php-wasm/node/asyncify/8_2_10/php_8_2.wasm
30.38 MB af06dc89d6b44d06588d314e9ec2cc38a2ada1cb packages/php-wasm/node/asyncify/8_2_10/php_8_2.wasm
30.38 MB 20ca3bb0f8e0c4314d85356499308f15da6cfb94 packages/php-wasm/node/asyncify/8_2_10/php_8_2.wasm
and I think just adding them to git lfs won't cut it rather we need to prune them from the git history which can be done by:
git filter-repo --path-glob '*.wasm' --path-glob '*.zip' --invert-paths
or using the above proposed solution to migrate big files to lfs.
Also, what about not including build files in the git repo at all and attach them as the release assets?
Also, what about not including build files in the git repo at all and attach them as the release assets?
I was thinking about that, too! Unfortunately, that approach seems less practical of the two:
- Creating and reviewing Pull Requests would get more difficult – where would the rebuilt wasm assets live? They need to be included in the PR as it's not reasonable to ask every reviewer to wait 10 hours for a full rebuild.
- Every wasm-affecting commit would require a release, otherwise
git checkout old-commitwouldn't reliably reflect the old project state. Ditto forgit bisect. This would limit the ability to work with the repository while offline. - While the final bundle only ships
php.wasmfiles, we also keepopenssland other libraries so we'd need to put them in the release as well. - We'd need a bunch of code to orchestrate this commit <-> release synchronization, download the artifacts in the local dev setup, CI etc. It would require maintenance and likely break every now and then.
@adamziel My biggest concern about LFS is GitHub limits:
| Plan | Bandwidth | Storage |
|---|---|---|
| GitHub Free | 10 GiB | 10 GiB |
| GitHub Pro | 10 GiB | 10 GiB |
| GitHub Free for organizations | 10 GiB | 10 GiB |
| GitHub Team | 250 GiB | 250 GiB |
| GitHub Enterprise Cloud | 250 GiB | 250 GiB |
Even at the highest tiers, it seems like LFS bandwidth costs could add up quickly for a repo like ours with almost 4GB of .wasm files per revision.
Perhaps my imagination is failing me here, but I do not see a reasonable path forward with LFS due to these limits. We can explore self-hosting LFS, but non-maintainers won't be able to use this without explicit permissions, would they?
After discussing this with Claude a bit and reflecting, I tend to agree with @thelovekesh that it might be worth exploring managing Wasm builds using GitHub releases, even for in-progress PRs.
AFAIK, GitHub does not place limits on releases (though if we wanted to, we could proactively delete releases older than a year or two, guessing that they should no longer be useful for things like git bisect). We could manage their creation and download using Git hooks.
I think this could also help folks with their own Playground forks on GitHub. If we script this right with Git hooks, shouldn't creating GH releases in their personal forks just work? Then, if we want to test a PR from a fork, the Git hooks should just be able to pull the associated GitHub release(s) from their fork.
This is a bit hand-wavy at the moment, but I think it might be worth exploring further.
What do you think?
--
Here are some specific responses to the current conversion:
Also, what about not including build files in the git repo at all and attach them as the release assets?
I was thinking about that, too! Unfortunately, that approach seems less practical of the two:
* Creating and reviewing Pull Requests would get more difficult – where would the rebuilt wasm assets live? They need to be included in the PR as it's not reasonable to ask every reviewer to wait 10 hours for a full rebuild.
@adamziel Maybe creating an associated GitHub release could be done via Git hook every time changes are pushed to a PR branch and Wasm builds have changed?
* Every wasm-affecting commit would require a release, otherwise `git checkout old-commit` wouldn't reliably reflect the old project state. Ditto for `git bisect`. This would limit the ability to work with the repository while offline.
Couldn't we make Git hooks that pull the right release after every checkout?
Regarding offline support, I think using LFS would already break that because large files are only downloaded on demand. Am I missing something?
* While the final bundle only ships `php.wasm` files, we also keep `openssl` and other libraries so we'd need to put them in the release as well.
Perhaps these could be managed as separate releases but in the same way as the .wasm prebuilds.
* We'd need a bunch of code to orchestrate this commit <-> release synchronization, download the artifacts in the local dev setup, CI etc. It would require maintenance and likely break every now and then.
Agreed. We'd need some code to coordinate this setup. I wonder how far we could get with scripting these updates via Git hooks. Maybe local dev and CI environments could rely on the same hooks.
@brandonpayton that could work if there was an easy way of working with it. We'd need some kind of version tag baked into the .js and .wasm files so Playground could refuse to boot on version mismatch.
Also, we'd need one release per commit range – how would we store and lookup the mapping?
Couldn't we make Git hooks that pull the right release after every checkout?
We could. Personally, I never let git hooks run because of how many times I've lost my work to a bug in hook stash/reset/lint/whatever flow, but perhaps most people would still benefit from that?
@adamziel may I know:
- how frequently does these wasm build changes?
- are there specific files on which wasm builds are updated? like updating any files in
packages/php-wasmcauses a rebuild? - any idea how much time it takes to build a wasm binary in github action runner? on my machine it only takes few minutes.
@thelovekesh
how frequently does these wasm build changes?
A few times in most months (older history).
are there specific files on which wasm builds are updated? like updating any files in packages/php-wasm causes a rebuild?
You need to run a rebuild manually, there is no CI automation for it. Updating wasm is typically not a by-product of updating some file but the intention behind the work. Specific files are typically php_wasm.c, Dockerfile, the emscripten library, C dependencies, and other similar.
any idea how much time it takes to build a wasm binary in github action runner? on my machine it only takes few minutes.
It used to take 6 hours and crash the CI worker, but that was before we've separated all the dependent libraries into separate builds. I'm pretty curious how fast would it be on a CI box – would you be interested in exploring this?
@adamziel
Here are the results of building the WASM binaries on GitHub runners: https://github.com/trywpm/wordpress-playground/actions/runs/22171069317/job/64108877062
I also ran the same build on a self-hosted runner with 8 vCPUs. On the default runners, the build takes about 10 minutes, whereas on the self-hosted runner it completes in just 3 minutes and 30 seconds.
Here’s what I suggest we do:
- Remove the WASM binaries from the repository and include them only in npm releases so that the published packages remain self-contained.
- Automatically generate the WASM binaries in CI runners when a rebuild is required, based on file changes in
pull_requestorpushevents. - Store the generated WASM binaries in an S3 bucket (or as CI artifacts) and retrieve them as needed. We can also add an npm command to handle syncing during local development.
This is similar to the workflow I use for wpm. For example, here’s how the workflow posts a PR comment with updated binaries for testing: https://github.com/trywpm/cli/pull/163#issuecomment-3812540920
And here’s how the binaries are uploaded to S3 in CI: https://github.com/trywpm/cli/blob/6a2f762a1e3c5fab57a05e90f9eeee2ef213cbf8/.github/workflows/ci.yml#L105-L110
Additionally, we could speed up the build stage by caching multi-stage Docker builds and potentially adopting docker bake to manage build operations more efficiently.
@thelovekesh @adamziel If we can run the builds on the CI now, then I think making WASM packages will serve us much better than LFS (especially considering the limits that @brandonpayton shared). I think that GitHub Packages could be the right storage for this, but I actually wonder if we could start even simpler. What if we:
Build WASM packages every time they are tagged and store them directly in the NPM packages.
This way, we could avoid implementing any mechanisms to detect "when a rebuild is required, based on file changes", and we would have no need for storage like GitHub packages or S3, as NPM would do it.
We could also decouple versioning of @php-wasm packages from @wp-playground, which would allow us to release them independently. What do you think?
@JanJakes there is a limit for GitHub packages - https://docs.github.com/en/get-started/learning-about-github/githubs-plans and I think that would an constraint moving forward.
Build WASM packages every time they are tagged and store them directly in the NPM packages.
Yes this should be the expectation for releases marked for distribution.
This way, we could avoid implementing any mechanisms to detect "when a rebuild is required, based on file changes", and we would have no need for storage like GitHub packages or S3, as NPM would do it.
We also need binaries while active development not just at the time of tagging/distribution. At that time, where we will store the wasm builds if needed?
@thelovekesh
@JanJakes there is a limit for GitHub packages - https://docs.github.com/en/get-started/learning-about-github/githubs-plans and I think that would an constraint moving forward.
GitHub Packages usage is free for public packages (LFS is not).
We also need binaries while active development not just at the time of tagging/distribution. At that time, where we will store the wasm builds if needed?
Right, yeah, I see what I missed here. It's true that we can't avoid some change detection, be it for publishing a package or for caching. I do wonder how far caching could take us. That is, we would "build" the packages for every push, but the builds would be cached. If cache hits were reliable, then it would be a simple setup. But GitHub has limits on those too, and NX remote caching is another dependency that may not be worth it after all.
GitHub Packages usage is free for public packages (LFS is not).
Oh thanks for pointing out.
So with this you mean, store wasm binaries while development into github packages? In that case, the binaries will be saved as a tarball, and we need to download the package and unpack them.
I do wonder how far caching could take us. That is, we would "build" the packages for every push, but the builds would be cached
Not sure if it's a good idea. To store cache, you will need a deterministic cache key and if that's not done properly, wrong cache hit can occur, and as you mentioned for cache storage there are limits but we can overcome them using an workflow like - https://github.com/ampproject/amp-wp/blob/develop/.github/workflows/cache-buster.yml
So with this you mean, store wasm binaries while development into github packages? In that case, the binaries will be saved as a tarball, and we need to download the package and unpack them.
I think it would be more like publishing a dev PHP-WASM NPM package in our GitHub NPM registry. Development packages would reference this one, while production would go to https://www.npmjs.com/. The advantage is this would work transparently everywhere, including dev machines, etc., but perhaps it adds unnecessary complexity. I don't know.
To store cache, you will need a deterministic cache key
This is equivalent to the problem of detecting whether a change occurred and a rebuild is needed, isn't it? We would need to solve the same problem for pushing to S3 or anywhere else as well.
This is equivalent to the problem of detecting whether a change occurred and a rebuild is needed, isn't it? We would need to solve the same problem for pushing to S3 or anywhere else as well.
I don't think so. It's relatively easy to trigger workflow based on file paths - https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#running-your-pull_request-workflow-based-on-files-changed-in-a-pull-request, rather than computing the sha of all files to create a cache key.
The thing that concerned me about GitHub Packages is the limits and at least, according to Claude, the limited ability to delete older packages.
https://gist.github.com/brandonpayton/bb684660fc3faf1d8a9de557a99c1a30#github-packages-limits-public-repos
Without the limits, I agree with @JanJakes that they sound ideal.
I wonder if we could work around the limits just by creating new package names per month or year.
After rereading the backscroll, I see that the GitHub Packages limits @thelovekesh linked to may be lower and more restrictive than I thought.
In considering all the options, I was hoping we could find a solution that would work generally for all forks on GitHub, but this is a tricky issue.
This is where I was hoping we might be able to use GitHub releases, along with the fact that NPM dependencies can be links to tarballs. If release storage is truly unlimited, then maybe we could create new releases every time we wanted to make a build and link to them in our package.json.
@brandonpayton
The thing that concerned me about GitHub Packages is the limits and at least, according to Claude, the limited ability to delete older packages.
You can delete any public package that doesn't have more than 5000 downloads. I think we'll never get anywhere close to that. There are GitHub Actions for package cleanups, including an official one.
After rereading the backscroll, I see that the GitHub Packages limits @thelovekesh linked to may be lower and more restrictive than I thought.
This doesn't apply to public packages.
The more I read about GitHub packages, the more it seems to me to be the right tool for the job. With a cleanup job, we don't even need to do any workarounds. We can consider splitting them per minor version (7.4, 8.0 , ...), but I think even that's not necessary. A cleanup job will do the trick.
I don't think so. It's relatively easy to trigger workflow based on file paths - https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#running-your-pull_request-workflow-based-on-files-changed-in-a-pull-request, rather than computing the sha of all files to create a cache key.
@thelovekesh Good point. The hashFiles function may offer the same syntax (glob patterns), but I suspect the workflow triggers are much more efficient because they can simply read the changed files from git diff.
This doesn't apply to public packages.
@JanJakes ah! That's great. I was understanding the overall docs to mean "it is free, but there are limits on the free thing". This looks promising.
This can be assigned to me as I am writing a workflow to move wasm builds to github packages.
I have free time from dot-com in the upcoming days so that I can take care of this if needed.
I have free time from dot-com in the upcoming days so that I can take care of this if needed.
That would be amazing, @zaerl.
This can be assigned to me as I am writing a workflow to move wasm builds to github packages.
🤦 Sorry, @thelovekesh, I missed your comment before @zaerl's.
Do you have any sense about how much time you have to invest in this in the near term? I had just been talking with @zaerl about focusing on this full-time, but having your workflow for moving the builds to GitHub Packages would be great.
@brandonpayton @zaerl please move forward with your work. my workflow is no where near to be something working that I can share. please feel free to ping for code review or architecture discussions.
brandonpayton @zaerl please move forward with your work. my workflow is no where near to be something working that I can share. please feel free to ping for code review or architecture discussions.
Thank you very much. 👍