RFC: Source distribution
I think we need to reconsider our model for the distribution of input tarballs/distfiles into live-bootstrap.
State of play
We have three "distinct"-ish sections of the bootstrap in this area, each of which have been treated with somewhat different requirements.
- pre-networking. Before networking is available, all distfiles must be pre-loaded onto the system.
- pre-SSL. Once networking is available, we immediately build curl, so we have the option (if
--external-sourcesis off), to download sources within the bootstrapped system. However, we cannot access HTTPS sites at this point, as we don't have SSL support. Therefore, all distfiles in this stage must be available over HTTP (non-SSL). - post-SSL. At this point, we have curl with SSL support, so we can get distfiles over HTTPS.
And we are currently using two ways to get distfiles:
- HTTP
- HTTPS
Note: some distfiles are effectively an endpoint running on-demand, or serving a cached, git archive.
Here are some "de-facto" rules we have been using;
- HTTP and HTTPS is allowed for pre-networking and post-SSL stages.
- HTTP only is allowed for pre-SSL stages.
- there is a non-#bootstrappable/bootstrappable.world source available for each distfile.
- this has proved particularly challenging in the pre-SSL stage, where there are often few HTTP-only sites available, and for Git snapshots, which are quite unreliable (currently, Gnulib is a problem)
Ideas/Questions/Proposals
Proposal: Do not require a HTTP-only, non-#bootstrappable source for each distfile in the pre-SSL stage.
Currently: We need an upstream source, or a mirror, or archive.org, hosted on HTTP, for every distfile in the pre-SSL stage. Suggestion: Host them ourselves on a HTTP-enabled server. This is OK, because it will have the same checksum as the upstream anyways. Furthermore, once SSL is available, it is easy to check the file from upstream also matches the checksum.
Problems:
- We control both the checksum and the distfile, so malicious changes could be easily slipped in.
- Mitigation: It is easy to check that the distfile is equivalent, using checksums.
- Mitigation: See proposal below regarding mirror network.
Proposal: Create git snapshots ourselves using git archive and distribute them ourselves, instead of using Git snapshots from cgit/gitweb/GitHub/similar.
Currently: If we need a particular Git commit, we download a snapshot of it from something like cgit, gitweb or GitHub. These tend to be unreliable or just randomly disappear (see Gnulib). Further, no-one is checking that the files are the same in the Git repository as they are in the generated snapshot.
Suggestion: git archives are created in a scripted manner, and distributed by us. Also, investigate building Git in the bootstrap process, then we can just git clone directly.
Problems:
- We control the distfile, so malicious changes could be easily slipped in.
- Mitigation: Create it using a script, so anyone can validate the work, as
git archiveis reproducible. - Mitigation: If
--external-sourcesis used,git clonethe repository instead and create the tarball as a part ofrootfs.py. - Mitigation: See proposal below regarding mirror network.
- Mitigation: Create it using a script, so anyone can validate the work, as
Proposal: Begin a mirror network.
Currently: We use nearly exclusively upstream sources for distfiles.
Suggestion: Pull (somewhat?)randomly from a global mirror network for distfiles, each controlled by different people. Each mirror would not mirror a #bootstrappable controlled server, but would mirror upstream files. For the previous Git proposal, each mirror would generate its own git archive snapshots. This makes it nearly impossible for a single internal bad actor to manage to both change a distfile and its related checksum within live-bootstrap.
Questions:
- How do we bootstrap the (ever-changing) mirror list?
- Suppose that for a particular distfile, an upstream source is sufficient (e.g. we are in the post-SSL stage, and are downloading a HTTPS-hosted distfile). Do we prefer the upstream source, or mirrors?
- Benefits of upstream source: Trust? Consistency? Puts less load on the mirror network?
- Benefits of mirrors: Puts less load on the upstream source?
Note that I still plan to eliminate the "pre-SSL, post-networking" stage, and switch to exclusively HTTPS downloads - my expectation is still that ISPs won't allow non-SSL traffic to pass through their networks for too long. Expect random RSTs injected into plain HTTP streams, or straight up blocking port 80, similar to how almost all ISPs block port 25 inbound, and many also outbound.
Also, we need to support a mode where rootfs.py gathers a copy of all files locally, and then spawns its own server for the bootstrap machine to download from. This is so that the bootstrap machine can be isolated from the Internet, and not get exposed to packets sent by untrusted sources, which might try to exploit some kernel-level vulnerability to compromise the bootstrap.
Also, we need to support a mode where rootfs.py gathers a copy of all files locally, and then spawns its own server for the bootstrap machine to download from. This is so that the bootstrap machine can be isolated from the Internet, and not get exposed to packets sent by untrusted sources, which might try to exploit some kernel-level vulnerability to compromise the bootstrap.
What is the benefit of this over, say, an --external-sources mode? Let's suppose that we had support for splitting the set of distfiles across multiple disks if that was a problem.
* How do we bootstrap the (ever-changing) mirror list? * Suppose that for a particular distfile, an upstream source is sufficient (e.g. we are in the post-SSL stage, and are downloading a HTTPS-hosted distfile). Do we prefer the upstream source, or mirrors?
I guess from mirrors, so that less strain on upstream. It's all checksummed anyway.
But this whole distributed mirror network sounds a lot like reinventing DHT from Bittorrent. Hence the question arises, can you reuse that?
But this whole distributed mirror network sounds a lot like reinventing DHT from Bittorrent. Hence the question arises, can you reuse that?
Hmm, I am not familiar with that, but it seems promising! More research required...
What is the benefit of this over, say, an
--external-sourcesmode? Let's suppose that we had support for splitting the set of distfiles across multiple disks if that was a problem.
If we can make --external-sources reliable in terms of picking the correct disk for running the bootstrap on, vs. reading sources from (or even do it on a single disk with partitioning), then it may be OK too.
My opinions, as a bootstrap enthusiast but someone who hasn't yet contributed all that much:
Mirrors are necessary
Upstream sources, particularly for older software, are not reliable. Every now and then I try turning off substitutes in my Nix and Guix configs, and I am always disappointed by files that have gone 404 without anyone noticing, or have changed hash, or whatever else. I don't think there's any reproducible bootrap without us mirroring upstream sources.
HTTPS is inevitable
I agree with @Googulator that HTTPS is inevitable. I think we are in that world already, and it didn't even need ISP shenanigans: I just reported a bootstrap failure to #477 where libtool-2.4.7 failed to download under Linux because the request was unconditionally redirected to HTTPS before we had a curl that could handle it.
Seed files are suspect
How do we know that we've put a true copy of builder-hex0 or any other seed file onto the boot device, given that it's come from an untrusted machine? Ideally, we should be building a checksumming program very early in the bootstrap, using it to check all the files used in the bootstrap thus far, and then using it to check any files from the root before we touch them.
This creates a tension: @Googulator proposes to minimise the HTTP-but-not-HTTPS phase of the bootstrap. This probably means shifting more source tarballs to pre-networking seed files. In addition to the hashing considerations above, I worry that relying too heavily on HTTPS introduces time bombs into the bootstrap process. If cipher suites change and the ones we can easily bootstrap into are deprecated and removed, the bootstrap is sunk. This is not theoretical: Guix has a similar issue where it currently cannot complete a bootstrap without substituters because of certificate time bombs in openssl-1.1.1l.
As for @fosslinux's proposals:
Re: "Do not require a HTTP-only, non-#bootstrappable source"
I think this is regrettable, but necessary. Historic mirrors are simply too unreliable. Whatever scripts use to acquire sources should cross-check against upstream if available.
Re: "Create git snapshots ourselves using git archive"
Is git archive deterministic? (As in, does git archive from a checkout of a given ref always result in byte-for-byte identical output?) It may be necessary to extract objects from a git snapshot in a particular order so that the archives are built the same way each time, and two people on different machines (and possibly even OSes) can build the same source snapshot with the same checksum.
This heuristic (requiring a deterministic process from a repo snapshot) also looks applicable to other DVCSes.
Re: "Begin a mirror network"
As above: regrettable but likely necessary. We reduce the risk of the #bootstrappable project becoming a single point of compromise (of itself) if we encourage and support the various upstreams to participate in existing mirror networks, instead of a #bootstrappable network of mirrors. Many ISPs, universities, etc, mirror FOSS projects.
If upstreams are not sympathetic, then taking on the mirroring ourselves will be necessary.
On Bootstrapping (ha) the Mirror List
Could we just keep a file up-to-date in a git repo? If the download script fetched 1/N of each file from N mirrors (using byte-range fetches, and choosing the N mirrors from the list at random), and the resultant file passed the checksum check, you have a good guarantee that the mirror is faithfully replicating the content.
On BitTorrent
It seems like BitTorrent will do a lot of what we want: it would let people join and leave the mirroring of bootstrap seeds, as well as let them find other sources for their bootstrap seeds. One obvious concern: we'd have to think about how often to re-issue the .torrent file.
BEP 39 (BEP = BitTorrent Enhancement Proposal) provides support for updatable torrents, which might help seeders keep up to date with the latest versions of the bootstrap seeds.
There are also two BEPs that allow torrent clients to use HTTP sources if other peers are not enough:
-
BEP 17 ("HTTP Seeding") documents a HTTP endpoint for requesting individual pieces of a torrent; and
-
BEP 19 ("Web Seeds") documents how to encode within a
.torrentfile that its contents are available over HTTP. Note that this essentially requires a web seed to contain the entire torrent's contents as a directory on the server; it won't let us define a torrent that identifies distinct HTTP servers for individual files.
It doesn't appear that you could use existing BT clients to verify that what's on the HTTP seeds matches what's in the swarm. That's something you'd have to build yourself.
If we do provide our own HTTP mirrors (which we probably should, even if we can get upstreams interested in mirroring), coming back later and publishing a torrent with web seeds should not be difficult. It could even be regenerated whenever there's a "major release" of the bootstrap, so that seeders are kept up-to-date.
Is git archive deterministic? (As in, does git archive from a checkout of a given ref always result in byte-for-byte identical output?) It may be necessary to extract objects from a git snapshot in a particular order so that the archives are built the same way each time, and two people on different machines (and possibly even OSes) can build the same source snapshot with the same checksum.
Yes, for the same git version. (https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/) (https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d4738bb525066c710)
Could we just keep a file up-to-date in a git repo? If the download script fetched 1/N of each file from N mirrors (using byte-range fetches, and choosing the N mirrors from the list at random), and the resultant file passed the checksum check, you have a good guarantee that the mirror is faithfully replicating the content.
This is the easiest solution I think, but my greatest concern is increasing your aforementioned "risk of the #bootstrappable project becoming a single point of compromise (of itself)"
How do we know that we've put a true copy of builder-hex0 or any other seed file onto the boot device, given that it's come from an untrusted machine? Ideally, we should be building a checksumming program very early in the bootstrap, using it to check all the files used in the bootstrap thus far, and then using it to check any files from the root before we touch them.
We already checksum every single tarball, in a bootstrapped fashion. (More details, see parts.rst, but we have builder-hex0 + stage0-posix, and part of stage0-posix is mescc-tools-extra, which contains checksumming program. Thus from that point onward we are same in that regard).
We reduce the risk of the #bootstrappable project becoming a single point of compromise (of itself) if we encourage and support the various upstreams to participate in existing mirror networks, instead of a #bootstrappable network of mirrors. Many ISPs, universities, etc, mirror FOSS projects.
Whatever scripts use to acquire sources should cross-check against upstream if available.
Yeah, agreed on both these counts. To clarify my thoughts a bit more on the second point;
- for mirrors acquiring sources: they should only get their sources from upstreams, not from other mirrors
- for end users of live-bootstrap acquiring sources: generally, default to upstream, and use mirrors where it is infeasible/impossible
Is
git archivedeterministic?Yes, for the same git version.
That should be enough to get started as a way to snapshot upstream. It would be ideal to have a tool that tried extremely hard to be deterministic here, but that seems fine to defer until future work.
We already checksum every single tarball, in a bootstrapped fashion.
Very cool. It seems pretty hard to slip something in before mescc-tools-extra.
for mirrors acquiring sources: they should only get their sources from upstreams, not from other mirrors
for end users of live-bootstrap acquiring sources: generally, default to upstream, and use mirrors where it is infeasible/impossible
Agree with these points, but it would be cool to strengthen the second: users could have the ability to fetch from mirrors and cross-check against (ideally) upstream or (if not) other mirrors. Enabling this by default would completely defeat the purpose of a mirror network (since every user would hit upstream), but you could request a random fraction of each file from another source to ensure they remained in sync.
Yeah, I would like to use mirror network more too (assuming we don't trust stuff from it before checksumming in some way). Perhaps a configurable option but I would prefer mirrors to be default.
Ok, fair points. We can make mirrors default.
Another point of discussion: I didn't really think of this when it came up originally, but now seems like a good time to revisit it. For packages such as bash-2.05b and bc-1.07.1, we have been using what appears to me as recompressed upstream tarballs distributed by 3rd parties, such as fedora and slackware, to save disk space in the early bootstrap.
I'm not totally sure the tradeoff is worth it. Obviously, recompression changes the checksum, and it is a layer of indirection from upstream that is pretty unverifiable.
I'd be for going back to upstream tarballs for those. At minimum, I would want our mirror network to do the recompression, rather than blindly trusting fedora/slackware there.
Unless you have very hard space requirements, I'd say that storage and RAM are both cheap enough to not worry. If you were bumping up against the addressing limits of 32-bit machines or something then we'd need to think more carefully but we already know that the bootstrap doesn't run on 2GB machines. Matching upstream is so much more important. Even if you do have stringent storage requirements, you'd need a deterministic and possibly bootstrappable compression program to crunch things down. RAM requirements would be more affected by uncompressed package sizes, anyway?
Unless you have very hard space requirements, I'd say that storage and RAM are both cheap enough to not worry. If you were bumping up against the addressing limits of 32-bit machines or something then we'd need to think more carefully but we already know that the bootstrap doesn't run on 2GB machines. Matching upstream is so much more important. Even if you do have stringent storage requirements, you'd need a deterministic and possibly bootstrappable compression program to crunch things down. RAM requirements would be more affected by uncompressed package sizes, anyway?
@Googulator wanted to be able to fit initial bootstrap sources into 256 MiB (basically the largest existing chips that can be programmed manually without software)... But yeah, it should probably be a separate patch on top of live-bootstrap for doing something like that...
Here's a dumb but mayhaps worthwhile suggestion: Can a crude, non-validating, insecure implementation of TLS be bolted onto curl before openssl and the certificates are available? I think there isn't a whole lot to secure when everything obtained is checksummed anyway.
As a more serious suggestion, I think that supporting anything but --external-sources-like behavior isn't very worthwhile. Unless you have a completely reliable internet connection, letting the success of the build depend on network connectivity and external servers which may hiccup, for the hours that the build may take, is really annoying. I've personally had to pray for the wifi to come back on before the curl retries exhaust and the build stops. It's not fun. Get everything up-front, cache everything so it only needs to be obtained once, and maybe provide the option of running a local mirror (to cover the "my device doesn't have enough disk space" use case), but depending on the internet isn't a good idea.
Here's a dumb but mayhaps worthwhile suggestion: Can a crude, non-validating, insecure implementation of TLS be bolted onto curl before openssl and the certificates are available? I think there isn't a whole lot to secure when everything obtained is checksummed anyway.
That's not nearly as dumb as it sounds.... I'd be very open to considering that; but I am unfamiliar enough with the TLS protocol to be fairly unsure if this is feasible. As we still need to do some kind of TLS negotiation + encryption. I feel like the only thing we'd avoid dealing with is validation, but I don't know that. There could be some "loophole" in the protocol that would let a client do that.
depending on the internet isn't a good idea
I find this difficult to justify; we need to obtain the sources somehow; the question is only whether we depend on the Internet within the bootstrap environment or outside of the bootstrap environment; mirroring non---external-sources and --external-sources respectively. Currently the dichotomy is also "downloads are spread out" vs "all at once" but we could change that (see below)
Unless you have a completely reliable internet connection, letting the success of the build depend on network connectivity and external servers which may hiccup, for the hours that the build may take, is really annoying.
Let's ignore transient problems that retries solve.
I've personally had to pray for the wifi to come back on before the curl retries exhaust and the build stops
for problems like this, we could do downloads all at once in the bootstrap environment
for "external servers", the mirror network idea solves this.
The primary benefit of non---external-sources mode is that over time, we should be able to reduce the bootstrap seed down to be very small, and source distribution would also be effectively bootstrapped. (Whether this is necessary is another discussion in itself.)
Some potential complications we have in freedesktop-sdk due to the reliance on git archive, if anyone is interested in this perspective: https://gitlab.com/freedesktop-sdk/freedesktop-sdk-binary-seed/-/issues/18#note_2374544756
Two high level points:
- I found it surprising that I needed to pass a flag to get a bootstrap which didn't access the internet and thus download things which can't easily be audited given that the entire point of bothering to do this is being able to audit sources as part of the process. Granted, I didn't bother checking the sources that were downloaded before the bootstrap but it was obvious that I could've done so.
- I find it surprising that even after the huge git archive checksum fiasco that no one seems to checksum the tar and simply compress it in different ways (sure, checksum the compressed file too if that's convenient) but it would be trivial to add a checksum of the tar to the list of available checksums in a release and while tar isn't designed to be checksumed either, at least it's trivial to maintain a deterministic implementation and output over time.
In the long term, I think live-bootstrap should distribute patched and cleansed sources. With that in mind, local mirroring makes more sense right now.
Having archives with binary or generated files inside the build is undesirable. Also, unused code is undesirable (irrelevant stuff that needs to be audited because it's included in the build somehow).
Ideally, we should split the regenerations into two steps:
- The cleaning step. Removing things we don't want or don't use in the build process, patching, re-packaging.
- The regeneration step. This happens within the steps.
This separation also makes things easier for the builder-hex0 -> fiwix transition (less files, less memory, less chance to bump into limits).
The local mirroring could evolve into step 1 gradually. Downloading everything locally, performing the cleaning step, re-compressing the clean sources and only then start building.
I can see the appeal of stripping out blobs and files which we must regenerate, but once we do that, our source packages deviate from the published upstream and their published hashes. Once we've done that, how can people trust them?
Hi @endgame, thanks for considering my idea!
It's a tradeoff, but I think it's the best one. We're already removing stuff from upstream precisely because we don't trust it.
In that scenario, anyone that only trusts archives from upstream will have to make arrangements to run the cleaning for themselves (it's fine, they just need to use our cleaning step).
The other way around is way murkier. We can't really prove that a binary file inside an archive is never used. Those files will be lurking there, offering a potential attack surface. For unmaintained versions (eg coreutils 5.0) that we partially build, I would go as far as removing unused files as well.
The other way around is way murkier. We can't really prove that a binary file inside an archive is never used.
But how could that binary file get silently used if the first step after unpacking is deleting it?
@stikonas we delete stuff from the build folder, not the archive. This behavior offers an opportunity for them to leak out.
For example, tcc-0.9.27 pass1.kaem re-extracts the contents of mes-0.2.7.tar.gz into its /build tree without regenerating the ~~nyacc~~ psyntax files. It's fine, I don't think they're used, but they're there, extracted and available.
Of course, we can inspect each step more carefully, avoid cross-step references, be tidier. However, nothing beats removing them from the build input entirely.
Edit: My example was referring to the incorrect regenerations, I meant psyntax, not nyacc.
I am hesitant to have live-bootstrap use modified distfiles for the reasons @endgame outlined. Furthermore, even if we used modified distfiles, I would want the process for creating those modified distfiles to be reproducible. At which point, why don't we just create the modified distfiles within the live-bootstrap environment?
At which point, why don't we just create the modified distfiles within the live-bootstrap environment?
Absolutely yes! That's the idea.
Today we have a main loop that iterate through the steps and does prepare, configure, compile, install for each.
We could setup this to have two main loops:
- Loop 1: Iterate over each step, do
prepare, then recompress each source back. - Loop 2: Iterate over each step and do
configure,compile,install.
This offers some technical advantages:
- We can settle on an uniform decompression tool, reducing complexity during the build process.
- Reduction on fiwix.ext2 size by removing unused files.
- More opportunities for CI caching (both our own CI and people who use live-bootstrap in their builds somehow).
There's a licensing angle to all of this as well. We need to make our modifications available in source code form. Today, we distribute an automated tool that makes those modifications, which maybe is good enough, but maybe it isn't.
Some distros do this thing by setting up forks and their own repositories, but I'm not talking about doing that. It could be much simpler and based off what we already do.
Loop 1: Iterate over each step, do
prepare, then recompress each source back.
So we are on the same page, is your suggestion to do prepare outside of live-bootstrap, using tools external to those built by live-bootstrap? That seems to be implied based on what you're saying. Otherwise how is this possible?
My initial thought is that we must ensure that the only thing that is done is deletion. Justification:
a) the original source file is interchangable (sans checksums) with the modified source file. b) any addition or modification requires a much greater level of trust in the tools performing the operation, than deletion. c) Particularly, regenerations of files (such as gnulib, bison, etc) must not occur using tools outside those built by live-bootstrap.
My intuition is that would be okay (probably as an option) if integrated with the existing mirror system, for the reason that it is a simple, reproducible, highly auditable transformation of existing data. I do not see an attack vector if implemented as I suggest, but I would like other people to chime in on that.
Practically, this would not look the way you suggest, we would still need to have prepare, for things like patching, regenerating deleted files, etc, but there would be a previous step that occurs outside the live-bootstrap environment that deletes files from tarballs.
There's a licensing angle to all of this as well. We need to make our modifications available in source code form. Today, we distribute an automated tool that makes those modifications, which maybe is good enough, but maybe it isn't.
Well, I assume here you are referring to viral licenses, such as GPL. They require source code to be distributed with binary distributions. We would be distributing either/or
- modified source code, which is not a problem, because all OSI licenses permit this
- (programmatic) instructions to modify the source code, which is very clearly not a problem
We only "distribute" binaries in CI artifacts, which also include the mirror last time I checked, but I will double check this.
So we are on the same page, is your suggestion to do prepare outside of live-bootstrap, using tools external to those built by live-bootstrap? That seems to be implied based on what you're saying. Otherwise how is this possible?
I'm thinking definitely inside the live-bootstrap repository, with tools built by live-bootstrap. One first build with unclean sources is necessary to setup the builds with cleansources. I think this first build can be partial.
This is within the expectations of current tooling. In order to trust python or curl, live-bootstrap needs a partial bootstrap build. The source cleaning would appear in this self-hosting step (bootstrapping the userland). Users who don't want to bootstrap userland can use their own tools.
a) the original source file is interchangable (sans checksums) with the modified source file. b) any addition or modification requires a much greater level of trust in the tools performing the operation, than deletion.
If we agree that rm and patch are not unlike python in the sense that they are userland tools, then there's no reason to mistrust them.
c) Particularly, regenerations of files (such as gnulib, bison, etc) must not occur using tools outside those built by live-bootstrap.
I agree!
Practically, this would not look the way you suggest, we would still need to have prepare, for things like patching, regenerating deleted files, etc, but there would be a previous step that occurs outside the live-bootstrap environment that deletes files from tarballs.
To me, there's no such thing as outside the live-bootstrap environment! We always need a working system to prepare an image. Ideally, this working system was setup by live-bootstrap itself. Within this dependency loop, all the tooling I'm proposing to use (rm, patch, glob, tar, gzip) comes alive in the bash2.05b step, very early.
Honestly, as I mentioned, I think of this as a long-term thing and more geared towards build systems than individual bare metal users. There are probably a lot of rough edges and details I'm missing.
Hmm, I need to clarify a few things (thanks for helping me think about what I need to include in developer documentation 😅)
When I say "within the live-bootstrap environment", I mean within the environment built up by live-bootstrap from the minimal binary seed (which could be as small as <1KB on bare metal for everything currently in scope). In other words, the environment built by the instructions in the seed/ and steps/ directories.
I think you might be making an incorrect assumption here;
In order to trust python or curl, live-bootstrap needs a partial bootstrap build. If we agree that rm and patch are not unlike python in the sense that they are userland tools, then there's no reason to mistrust them.
Yes, we can trust rm, patch and python that are built by live-bootstrap. I have no problems with that.
Our threat model means we must not require intrinsic trust on rm, patch, python, curl` or any other tool from the system used to prepare live-bootstrap. This may be achieved by either diversification or intrinsic distrust of the tool.
I'll go through each instance that may be perceived as trust;
-
rootfs.py. This is an optional helper/development tool that is used to make live-bootstrap faster to develop and possibly deploy, if the user's threat model permits trust of the host system. There are (possibly outdated) instructions in the README on how to set up live-bootstrap to be run WITHOUTrootfs.py. There are further early-stage developments being made to minimise the complexity of, or diversify the ways, to prepare a disk to be used with live-bootstrap. At any rate, we do not intrinsically rely on python from the host system. Preparing the disk is probably where we have the least diversification currently, but it is possible to increase this. -
curlto download source files. We have no trust incurl, or any other tool used to obtain or copy source files. Source files are protected by checksums within the live-bootstrap environment. - The new mirror system, which uses
git archiveto generate tarballs from Git repositories. Presently, this does require some amount of trust in thegit archivecommand. However, there is no trust of one individual, because it can be replicated by anyone, and is often replicated by upstream themselves. I imagine in the future we can avoid this by working closer to the git repository level (e.g. git bundles, and providing a git implementation early in the bootstrap).
We always need a working system to prepare an image.
Currently, practically, this is true. The direction of live-bootstrap is to minimize the use/trust of the working system as much as possible, rather than take use of it as much as possible.
Ideally, this working system was setup by live-bootstrap itself.
We must not ever assume this though, because that would kill the whole point of live-bootstrap; sure this provides more trust, but this is not an assumption I ever want to make.
The model of live-bootstrap's processing is something as follows
Now we also need to consider if the sources themselves are "trustworthy". There are untrustworthy sources, namely in live-bootstrap land, those that are pregenerated/binaries.
The challenge is how to remove those untrustworthy sources. Particularly, how to remove those untrustworthy sources in a trustworthy manner. My concern with your proposal is that we are increasing the untrustworthy component of processing. So is it worth increasing non trustworthy processing at the benefit of being more certain that there are no non trustworthy sources provided into the inner rectangle in the diagram? I'm not convinced.
Honestly, as I mentioned, I think of this as a long-term thing and more geared towards build systems than individual bare metal users. There are probably a lot of rough edges and details I'm missing.
Agreed!
Thank you very much for your thoughts. These are very helpful to me and to live-bootstrap generally.
A few things to add now that I think of it
- all source code is trusted
- all binaries are untrusted, apart from the minimum binary seeds
- all binaries built within the live-bootstrap environment are trusted
- this is why everything within the live-bootstrap environment is trustworthy, because it is built up from source code only
- and why everything outside the live-bootstrap environment is not inherently trustworthy, because it is intrinsically using untrusted binaries
I am going to write up a document on this in the next month or so hopefully. Stay tuned
@fosslinux
There are further early-stage developments being made to minimise the complexity of, or diversify the ways, to prepare a disk to be used with live-bootstrap
I currently have an automated solution for preparing the image with very simple tools. It depends only on sh, dd, wc, cat, find and mkdir. Tested on qemu but not real hardware.
However, if I try to include all distfiles in the image beforehand, it gets stuck on make_fiwix_initrd, for what I believe to be related to size limits on builder-hex0 (I could be wrong, if you know how to fix that it would help me a lot!). I then noticed that the python launcher is capping the distfiles to those before get_network (I wasn't doing that).
To fix that, I've been working to create a tiny local webserver based on nc and mkfifo to act similar to the SimpleMirror package (mixed results for now, server starts and the VM sees it, but I get some corrupted downloads), but honestly, I would prefer having all distfiles included in the very first image.
Let me know if you'd be interested in that. I can open a draft PR with my progress.
My concern with your proposal is that we are increasing the untrustworthy component of processing.
We could theoretically make it a part of "a live-bootstrap run" if we accept a reboot and self-modifying image between builder-hex0 and fiwix. I also have a working prototype for that, but it's very early stage. There are no external tools or manipulations between the reboots, the image does it all by itself.
Thank you very much for your thoughts. These are very helpful to me and to live-bootstrap generally.
No problem! This is a very interesting project and it's me that should be thanking you folks for the hard work!