bootc-image-builder
bootc-image-builder copied to clipboard
Requires running under rootful podman
A paper cut we hit today is that podman desktop defaults to rootless, and bib doesn't work with that because we need loopback. The core problem is we need to write Linux filesystems. The important Linux filesystems like XFS/ext4 in general really want to be only written by code from the Linux kernel.
Running the Linux kernel is either done by reusing the host kernel (privileged), or running a VM. But on the podman machine case we're already in a VM, which gets us into nested virt, and on Mac at least that's going to involve full emulation which usually mostly works but isn't considered a production scenario and definitely hits weird random bugs.
My inclination because we're already running this container with --privileged is just to behind the scenes reuse the fact that podman machine uses FCOS today and the core user has passwordless sudo enabled and basically reuse that to re-execute ourselves with real root privileges. Yes, this would not really be "rootless" but I personally don't care about that and I don't think users would really in general either.
We're also investigating if we can do at least (some) of the filesystem work with libguestfs.
libguestfs is just a way to run VMs, so the nested virt concerns above apply.
Right, so I was just reading about the internals and yeah libguestfs uses qemu to boot a kernel and sets up an "appliance" to talk to it. :|
The 3rd option (beyond host kernel and virt) is https://github.com/lkl/linux which is relatively new and specifically cptofs is about this problem but...I really don't think it's worth trying to scope this in right now.
libguestfs doesn't require KVM: https://libguestfs.org/guestfs-faq.1.html I guess it just falls back to emulation if there's no KVM. The question is how fast it is.
Mounting directly uses FUSE and is pretty poor, but supposedly using the shell can be quite good. We can benchmark of course.
FTR, this works on rootless podman machine on macOS: test.sh
#!/usr/bin/env bash
set -euo pipefail
fname="${1}"
truncate -s 100M "${fname}"
mkfs.ext4 "${fname}"
guestfish --rw -a "${fname}" << EOF
run
list-filesystems
mount /dev/sda /
copy-in test.sh /
cat /test.sh
quit
EOF
echo "DONE"
rm "${fname}"
Containerfile
FROM fedora:39
RUN dnf -y install libguestfs
ENV LIBGUESTFS_BACKEND=direct
COPY test.sh /test.sh
ENTRYPOINT ["/test.sh"]
Note that https://github.com/cgwalters/osbuildbootc/ doesn't use libguestfs, but it does use the underlying tool (supermin) to construct a VM root filesystem out of the container rootfs and works unprivileged today.
Honestly I think that code and approach there is much simpler than the "higher level" libguestfs approach because we have the ability to drive things at a low level.
So if we go down this path I think it'd make sense to look at merging that code.
(The other thing osbuildbootc does it defers all the heavy lifting to bootc install to-disk, which is https://github.com/osbuild/bootc-image-builder/issues/18 )
the underlying tool (supermin) to construct a VM root filesystem out of the container rootfs
That said what would make much more sense in a modern times is to use virtiofs as the root filesystem instead, it probably wouldn't be too hard. I just haven't dug into it.
Honestly I think that code and approach there is much simpler than the "higher level" libguestfs approach because we have the ability to drive things at a low level.
For example, forcing indirection through libguestfs's high level APIs reintroduce the same problems that osbuild creates today that motivates https://github.com/ostreedev/ostree/pull/3094 - what we're doing often wants to do quite low level filesystem and block device things. libguestfs is just high level sugar for executing arbitrary code in a transient VM, and we can construct a transient VM without it.
I'm worried that doing the whole build under supermin might be extremely slow if KVM is not there. Whereas if we just offload the final copying part, it might be fine. I know that @achilleas-k is working on some benchmarks.
Also, full QEMU emulation isn't supported on RHEL. I wonder if guestfs has an exception....
libguestfs doesn't have an exception, its main use case is just targeted being used from Linux hosts.
I am currently catching up on https://github.com/containers/podman-desktop-extension-bootc/issues/93. What's the current status of this issue? The root requirement can be documented (as pointed out in https://github.com/containers/podman-desktop-extension-bootc/issues/93) but I want to have a better understanding.
I doubt we're going to do anything major here soon, I think we should just document switching or initializing with --rootful.
I don't think we have any ways to fix it. bootc-image-builder is meant to run in environments (Mac) without KVM support. libguestfs is utterly slow without KVM. mkfs.xfs protofiles don't work well with the bootc install model (unless bootc gets support for it).
EDIT: Just to clarify, the issue is that we need to mount the disk file so we can write the files into it. That can be done only by a root in the top-level user namespace. Root in a rootless container simply cannot do it.
mkfs.xfs protofiles don't work well with the bootc install model (unless bootc gets support for it).
Right, to elaborate on that slightly it would create wildly distinct mechanisms for "day 1" versus "day 2". It's not impossible...but would be extremely hard to maintain over time.
I think this should actually mostly land on the bootc side; making partitions unprivileged is easy. So moving to https://github.com/containers/bootc/issues/859
I think this should actually mostly land on the bootc side; making partitions unprivileged is easy. So moving to bootc-dev/bootc#859
Can we reopen this? Since for the short to mid-term b-i-b is meant to be user-facing, it feels like something we should track here, even if the longer term fix ends up being elsewhere.
For the shorter term... I want to second using supermin here. Basically if we just do:
- if root, then run as is
- if non-root but
/dev/kvmis present, then do a recursive call to rerun the invocation within supermin - if non-root and
/dev/kvmis not present, error out
I think that'd go a long way. I may be wrong, but my intuition is that in the non-Linux case where you're actually using podman machine, you likely don't care as much that it's running rootful because it's in a VM anyway that exists purely for running containers and not much else. Whereas if you're already on Linux working on your primary system, you likely care very much what you run as root.
And this helps for running this in pipelines. E.g. all CoreOS artifacts today are built in pipelines that have no root access, only /dev/kvm. As a guest user in a managed cluster, /dev/kvm access is a much lower barrier than having true root access and fiddling with loopback devices etc... (See also https://github.com/coreos/fedora-coreos-tracker/issues/1906)
Sorry that I missed this issue. Supermin looks indeed very interesting and I did a quick experiment based on the (excellent) work/examples from https://github.com/cgwalters/osbuildbootc/ (thanks!) to build an image from a supermin version of osbuild and it seems quite feasible and might be a nice option for both image-builder-cli and bootc-image-builder.
Thanks @mvo5 for looking at this. Indeed I do think it makes a lot of sense to do because it's already written code (what I did there to be clear is a 70% fork, 30% rewrite of the code in coreos-assembler) which has been doing this for years successfully.
However, one important background thing on this is that when we started this thing out we had a constraint to work on MacOS too...and going from supporting 1 OS (Linux) to 2 OSes doesn't make things twice as hard, it's at least 8 times harder...
We can't do the KVM thing there; it's actually fantastically more complicated because we would need to run code on the host to spawn new VMs. I tried to sketch this out a bit in https://gitlab.com/fedora/bootc/tracker/-/issues/2#note_1941758767
But anyways, yeah let's not care about that for now and just make Linux better.
Nice, thanks @mvo5!
I'd suggest taking https://github.com/coreos/coreos-assembler/blob/main/src/supermin-init-prelude.sh as a base. A few tiny tweaks went into it after @cgwalters's fork that I think are relevant here (notably https://github.com/coreos/coreos-assembler/commit/e6aa66a55b770ae20b2dc555c48bde40b24530a5).
I opened https://github.com/osbuild/image-builder-cli/pull/189 now, needs some small tweaks and I need to double check it with bib but it should be working there in the same way).
Oh, thanks for this discussion. The root privilege requirement is indeed very awkward -- effectively, there is no privilege separation at all, and this is impossible to run in e.g. kubernetes or CI environments which don't give you root access on the host.
libguestfs, supermin etc. are all very expensive. It may be more useful and much more efficient to take a page out of mkosi's book -- it can create full OS images entirely without root privileges. The core trick is to never mount the target file system in the first place, which eliminates all the root/host privileges/"kernel needs to write it" requirements. And instead leave the population of file systems to the tools that know the file system by definition -- mkfs. E.g. mkfs.ext4 has this option:
-d root-directory|tarball Copy the contents of the given directory or tarball into the root directory of the file system. Tarball input is only available if mke2fs was compiled with libarchive support enabled and if the libarchive shared library is available at run-time. The special value "-" will read a tarball from standard input.
mkfs.btrfs has an equivalent -r|--rootdir <rootdir> option. I haven't checked mkds.xfs closely, but even if that doesn't support it, being able to run rootless with --rootfs=btrfs|ext4 would be a big win.
I haven't looked into the details (it seems systemd-repart may also come into play here), but this feels very promising.
Oh, thanks for this discussion. The root privilege requirement is indeed very awkward -- effectively, there is no privilege separation at all, and this is impossible to run in e.g. kubernetes or CI environments which don't give you root access on the host.
libguestfs, supermin etc. are all very expensive. It may be more useful and much more efficient to take a page out of mkosi's book -- it can create full OS images entirely without root privileges. The core trick is to never mount the target file system in the first place, which eliminates all the root/host privileges/"kernel needs to write it" requirements. And instead leave the population of file systems to the tools that know the file system by definition --
mkfs. E.g.mkfs.ext4has this option:-d root-directory|tarball Copy the contents of the given directory or tarball into the root directory of the file system. Tarball input is only available if mke2fs was compiled with libarchive support enabled and if the libarchive shared library is available at run-time. The special value "-" will read a tarball from standard input.
mkfs.btrfs has an equivalent
-r|--rootdir <rootdir>option. I haven't checked mkds.xfs closely, but even if that doesn't support it, being able to run rootless with--rootfs=btrfs|ext4would be a big win.I haven't looked into the details (it seems systemd-repart may also come into play here), but this feels very promising.
Yes, however that can't do all supported file systems (as not all filesystems can be constructed from a root tree) and depends on systemd-repart doing the resizing of/partitioning at boot which doesn't fit (some) use cases an an example non-GPT systems. This has been partially looked into before and I think @ondrejbudai might have more details because I think we did experiment with it :)
mkosi/systemd-repart comes with a lot of limitations:
- it only supports GPT partition tables (this is becoming a niche over time, the biggest blocker in the hobbyist scene is raspberry pi 3 with older firmwares)
- it doesn't support anything based on device-mapper, so no LVM, LUKS, nor dm-verity
mkfs.xfsprotofiles don't support extended attributes, rendering it unusable for SELinux-enabled systems
Yes, you can create perfectly valid images with ext4/btrfs and GPT, but the cost would be pretty high for us, because we would basically have to create a second partitioner with a lot of limitations, without very little hope for actually ever closing these gaps (there's some work on fixing the xattr situation on xfs, but the device-mapper situation is pretty dire).
My opinion is that it's much better to invest into a micro-vm based (supermin) solution. CoreOS and automotive folks already have their wrappers around osbuild to perform rootless builds. The rise of OpenShift Virtualization showed us that having k8s with /dev/kvm available is becoming more and more popular. Finally, this would allow us to have a full feature parity between rootless and rootful builds: it doesn't matter how you run your image building tool, it just works. We have basically a fully working PoC for this by @mvo5 in https://github.com/osbuild/image-builder-cli/pull/189.
To be clear though some of this is underlying tool limitations, especially protofile ones. We also hit the same for fsverity in https://github.com/tytso/e2fsprogs/pull/203/
I do think there's some use case for fleshing this out more eventually, but:
My opinion is that it's much better to invest into a micro-vm based (supermin) solution.
Yeah. Though I think the really strongest argument is that almost always when building operating systems one wants to test them, especially sanity checks, and for that you really want a full VM. That's what I always had as the rationale for coreos-assembler and I think it's proven itself in having an opinionated tool that bundles building and testing.
Aside from the above having a (micro) VM is also the only way to prevent host things leaking into the build environment, this is an often recurring issue in mock (where a (micro) VM approach has also been suggested) and/or host kernel limitations on the buildroot.
Mock tracker: https://github.com/rpm-software-management/mock/issues/1559
@ondrejbudai
it doesn't support anything based on device-mapper, so no LVM, LUKS, nor dm-verity
This is only partially correct systemd-repart supports unprivileged creation of LUKS and dm-verity just fine without loop devices. LVM is indeed unsupported because I don't think the LVM tooling supports unprivileged operation.
Just pointing out, but this Issue should probably be the highest priority on this project. Outside of a dedicated and relatively minimal Fedora VM, it's unlikely any use cases of this tool will ever work currently. My experience is definitely that it fails 100% of the time on every system.
It's been extremely well known and documented for more than a decade that the approach of using loop devices directly on an arbitrary host system will rarely ever work successfully. All kinds of things on a system try to automatically involve themselves in the processes related to disk partitioning, mounting, file system creation, and file system population, especially when it comes to root partitions. You see effects anywhere from newly created partitions not being mountable due to race conditions, to file systems not being mountable/unmountable, to file system contents being corrupted. Additionally, you're restricted by the "host" OS's support/capabilities, including file system types you're trying to use, and SELinux/security xattr support. In the case of SELinux specifically, if you try to pre-label anything in the rootfs before deploying into the image, you run into host limitations on user/daemon SELinux permissions to create files with certain protected SELinux labels that appear in the target rootfs.
I suspect the reason these extensive issues haven't been more obvious is that most users are currently testing on a relatively minimal Fedora system, probably a VM of some sort, that they've manually created.
If you use a somewhat minimal Fedora VM, you've eliminated a ton of the tools/processes that normally involve themselves and cause conflicts because they get disabled when real hardware isn't present. Additionally you've matched your "host" OS to the target OS, so you have virtually guaranteed file system type availability and file system capabilities.
These are all hiding the fact that you won't have any of these guarantees under most use cases defined as relevant for these tools. The only way to make sure they're available is to provide them yourself, which is virtualization.