ChezScheme
ChezScheme copied to clipboard
Move Boot Files to Releases?
In order for ChezScheme to compile it needs (comparatively) large binary files to bootstrap the system first. Right now these files are included in the git repo itself, and updated frequently. Due to the way binary files are handled by git this effectively is causing massive size ballooning in the size of a full git repo (at last count I saw the .git repo occupying almost a full gigabyte). As the repo is now hosted on github would it be feasible to move over to using github's binary release system?
https://help.github.com/articles/distributing-large-binaries/
This would allow the git repo to remain binary-file-free while still making it feasible to compile the system from scratch.
If something like this is not done then the git repo runs the risk of becoming unmanageable in size as it would effectively contain a full backup of every binary for every system ever released.
This would indeed be useful.
It's worth looking into, because having boot files in the repo is definitely a problem, and we haven't yet come up with a different solution. That would force us to create new releases every time the uploaded boot files need to be rebuilt, which is less often than we currently rebuild them.
If this is pursued I would recommend updating the build process to be able to compile using the system installed ChezScheme files (for people who have already built it once). Or at least make it default if this is already possible (last time I recompiled I remember its still needing the boot files instead of using my already installed version, mine could've been too out of date or something though).
One caveat to this is that to my knowledge the release archives git creates do not include the git repo itself, as in the .git directory.
@akeep and I have discussed several potential solutions for this issue, none of them ideal, and none of them fully tested. Comments and corrections on these potential solutions and suggestions for others are welcome.
(1) Leave the full set of boot files in the main repository. We want to store the full set of boot files somewhere, and the main repository is the obvious place. Encourage people to clone with --depth 1
if they don't need the full repository.
PROS: This is the simplest solution. With --depth 1
, the time to clone and the amount of space are both small.
CONS: The full repo is still large, and a full clone is slow and requires a lot of disk space, problems that will only get worse.
(2) Prune older boot files from the repository and archive them elsewhere. We'd probably keep older boot files for a specific machine type, say a6le, since a build for any machine type can be used to build boot files for the other machine types.
PROS: This would keep the repository size reasonably small and doesn't require people to use --depth 1
to get a reasonably quick and small clone.
CONS: This solution is more work for the committers and requires them to modify older commits, which seems unclean and dangerous. It complicates the build of older commits. We'd have to find somewhere else to store older boot files.
(3) Use github's large-file support for boot files.
PROS: This reduces the size of a clone.
CONS: This doesn't reduce overall repository size. It requires the use of a different tool chain, and a normal git clone
will not produce a buildable clone.
(4) Create a separate project for the boot files, treat it as a submodule, and have the configure script use --depth 1
when it does the submodule init. (Thanks to @jamtaylo for this suggestion.)
PROS: This reduces the size of the main repo to the space required just for the source code history, and it reduces the size of a simple (non-recursive) clone.
CONS: This is a bit more work for the committers. A recursive clone of the main repo without --depth 1
will cause the entire boot-file repo to be cloned.
(5) As suggested by @ultimatespirit in this issue, create a release each time the boot files change, and upload boot files only as part of a release. Modify the build process either to require the appropriate release to be installed or to download and build the appropriate release as part of building the current version.
PROS: This reduces the size of the main repo to the space required just for the source code history, and it reduces the size of a clone. It reduces the chance that existing and possibly incompatible object files are used with a newer or older version.
CONS: This is more work for the committers and complicates the build scripts and process. It significantly increases the size of the repo as a whole, since the releases are required for build and are effectively part of the repo.
Aside from these ideas, we have come up with a couple of ways to reduce the number of boot files:
(A) Don't store boot files for the thread versions, since these can be created from the non-thread versions. This cuts the boot-file storage requirements in half but complicates the build process.
(B) Don't create new boot files as long as the existing set can still be used to compile the sources. At present, we typically create new boot files after any change in the s
directory. This requires some cleverness on the part of the committers to recognize when a change is not important. For example, @akeep chose not to create new boot files for the fx+/carry
and company fixes since these routines are not used in the compiler. This is error-prone since fx+/carry
might be used by the compiler in a subsequent commit.
(C) Store only the boot file for one machine type, say a6le. This would effectively eliminate the boot-file storage problem but would require all cloners to have a Linux box or VM sitting around to build boot files. We did this before Chez Scheme was open-sourced, but only a handful of us were affected. (Don't bother commenting on this option, since it's not a serious suggestion, just mentioned for completeness.)
(B) Don't create new boot files as long as the existing set can still be used to compile the sources.
For what is worth I agree with this.
This requires some cleverness on the part of the committers
I have been bitten by this many times. But now you can use Travis CI to build a test boot file using a new boot file at every repository push (it increases job execution time, but life is hard).
Is it possible to have non-binary boot files?
@xaengceilbiths
Yes, it is possible to have non-binary boot files, in fact there are a couple of different ways to do that.
The simplest way to do this would be to take the existing binary files and encode them as text (imagine using something like uuencoding, though we would likely want to define our own encoding). The hope here is that the binary files are relatively stable and the encoding of the binaries is also relatively stable, so the changes are minimized and the differences stored in the repository are hence minimized. @dybvig and I experimented with this, but found, after trying a few encodings, that this style text file was not helpful in reducing the size of the binary differences, in fact if I recall correctly, it pretty much made things worse, which was a bit disappointing.
Another way to do this would be to have an intermediate representation of the compiled code that is represented as text, which Chez Scheme's run time can either interpret or finish the compilation of before running. This is a considerable amount of work and has some challenges. We would likely need to rework the machine-type-specific assemblers (which are written in scheme) so that the same type of work can be done without having the scheme binary around. If these files are machine-independent we would also need to capture things like the machine-specific foreign function interface code (also written in scheme) in the C run time. If these files are machine-dependent, then we would still need the same array of them and the text representation is going to be larger than the binary representation, so the working checkout would be larger, though hopefully the differences would be smaller.
So, I think a textual representation is a good idea, but getting there is a bit of a challenge.
Here's another alternative. You can store each boot image along with a binary diff that will update it to the latest image. The bsdiff tool (http://www.daemonology.net/bsdiff/) is really good at creating such diffs. I did an experiment:
$ git show 303921d8515:boot/a6le/petite.boot > petite-current.boot.gz
$ git show c9c45641cc5:boot/a6le/petite.boot > petite-previous.boot.gz
$ gunzip petite-*.boot.gz
$ bsdiff petite-previous.boot petite-current.boot petite.boot.bsdiff
$ ls -lh petite.boot.bsdiff
-rw-r--r-- 1 weinholt weinholt 6.9K Nov 29 19:48 petite.boot.bsdiff
The diff between the current and previous images is a mere 6.9K, which is easily stored in git. I'm not sure if this is a representative result, but it looks promising.
The problem isn't storing binary files, but rather that different versions of binary files can be incredibly different, with regards to diff tools, from version to version. It seems bsdiff is designed to be hyper efficient at diffing binary files so the diff between two versions would be small. However, the question then becomes what is the diff between two diffs like? That is, given N boot images you would have to store N-1 diffs, if each diff is significantly different from each other then you end up storing each diff individually in the repo instead of the deltas between them. Have you tested the difference between multiple bsdiffs?
@ultimatespirit Do you mean doing a bsdiff of two bsdiffs? I'm guessing it will not give good results, since the bsdiffs are compressed. One would indeed have to store multiple diffs in some way, perhaps with a reset of the base boot image every now and then. You'd probably store (bsdiff base base+1), ..., (bsdiff base+N-1 base+N). With luck, if the 6.9K result is typical, then with bsdiff you'd be storing a diff which is in the same order of magnitude as the code changes themselves (although one per machine type).
Why not create a scheme->c translator for chez scheme? (just joking, or not) then there would not be need for boot files :)
Another possibility, albeit highly unlikely to be practical, would be to convert the compiler code / a micro compiler to LLVM IR and distribute that to be bootstrapped into the first scheme compiler to finish the rest of the compilation with.
A big pro to this would be that LLVM IR is platform independent so we would not need to have multiple platform binary versions (so long as LLVM itself supports the platform).
Of course this would rely on a hope that the LLVM IR would change little between versions or at least be significantly smaller than the compiled binaries, otherwise it would only replace the problem with something else.
Out of curiosity, is it known what minimal feature set of (chez) scheme would be required to bootstrap the compiler? If it isn't too large a c / llvm microcompiler to begin bootstrapping from may not be too far fetched. If it's large perhaps boot strapping from older versions up would work (though not very practical I know).
Unfortunately LLVM IR is not platform independent or portable.
I installed Git Large File Storage, and it integrated seamlessly with git.
I then ran git lfs migrate import --include="*.boot" --include-ref=refs/heads/master
to rewrite the master branch using LFS.
I pushed the result to https://github.com/burgerrg/ChezScheme.
Running git clone on this repository downloaded a mere 40 MB and still allows me to go back in time and retrieve the boot files.
@akeep, does TravisCI work with GitHub's large file support? It looks as though it's supported by default on Linux machines, but requires an install for macOS. I followed the example and got it to work on macOS, but it broke all the Linux builds because they don't have Homebrew.
It looks like it is semi-supported. It is now enabled by default on they Linux images, but has to be installed via homebrew on the Mac OS images, so we'll need to update the travis-ci config to set this up on the mac builds before we build. There is some mention of authentication here, but I'm not sure what it refers to, we might need to experiment with it a bit to get it working.
I like the idea of a solution like this as long installing git lfs isn't deemed to be too big a bar to get over. Certainly, I like the idea of this over building our own support for something like this.
-andy:)
On August 16, 2018 at 4:30:21 PM, Bob Burger ([email protected]) wrote:
I installed Git Large File Storage https://git-lfs.github.com/, and it integrated seamlessly with git.
I then ran git lfs migrate import --include="*.boot" --include-ref=refs/heads/master to rewrite the master branch using LFS.
I pushed the result to https://github.com/burgerrg/ChezScheme.
Running git clone on this repository downloaded a mere 40 MB and still allows me to go back in time and retrieve the boot files.
@akeep https://github.com/akeep, does TravisCI work with GitHub's large file support?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cisco/ChezScheme/issues/203#issuecomment-413674533, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG5r2cRqE91tRlLgERj8EDUk5QrYlJ7ks5uRdYOgaJpZM4PDXeO .
I figured out how to get it to work by updating .travis.yml with the following:
before_install:
- if [ $TRAVIS_OS_NAME = osx ]; then brew install git-lfs; fi
- if [ $TRAVIS_OS_NAME = osx ]; then git lfs install; fi
before_script:
- if [ $TRAVIS_OS_NAME = osx ]; then git lfs pull; fi
I didn't do anything about GitHub authentication, and it seems to be working.
Installing git-lfs is easy on Linux, macOS, and Windows. The documentation online is easy to follow.
What other things should we investigate to determine if we want to use git-lfs for Chez Scheme?
I deleted my copy because it exceeded the 1 GB GitHub LFS free limit.
The real problem with git-lfs is that it is the way GitHub has built their infrastructure and accounting makes it unsuitable for public repos. To wit: "Pushing large files to forks of a repository count against the parent repository's bandwidth and storage quotas, rather than the quotas of the fork owner." [source] I don't know who thought that would ever be an ok thing for a public repository that anyone on GitHub can fork.
For what it's worth, cloning with --depth=1
is working out well for me so far. IMO, the biggest problem with it is that you have to know to do it before you clone. We could add a prominent notice to that effect at the top of the README until / unless we decide to do something with the boot files besides maintain the status quo.
I may be going insane, but I think it would be great if Chez can be bootstrapped from a smaller Scheme interpreter implementing, say, R5RS + R6RS library
+ syntax-case
so that we can bootstrap without bootfiles.
Realistically, porting pb
(portable bytecode) from the Racket's fork seems reasonable.
Closing, since pb
is merged, and since we revisited the options for that merge and stayed with pb
files as checked in.
FWIW, I find that --filter=blob:none
for a partial clone works much better that --depth 1
for a shallow clone. Partialness seems to be stickier and works better when pulling and switching branches. The README now recommends --filter=blob:none
.