bootc icon indicating copy to clipboard operation
bootc copied to clipboard

install: Spike on working unprivileged

Open cgwalters opened this issue 1 year ago • 6 comments

The need for install to-filesystem|to-disk to operate privileged has come up in a few contexts, most recently in https://github.com/osbuild/bootc-image-builder/issues/98#issuecomment-1989213198

The mkfs.ext4|xfs|etc tools support a -d <root> to create filesystems unprivileged. However...the annoying problem here is that handling things like uid/gid and selinux labels unprivileged gets hard.

One hack I was thinking of here is...maybe we could experiment in with something like using fuse to create a mocked up root. IIRC OpenEmbedded has a LD_PRELOAD thing to intercept syscalls, which is pretty hacky but probably works.

What'd obviously be nicer is if these tools all took something like a composefs-style dumpfile as input. But I bet the fuse thing would work.

cgwalters avatar Oct 30 '24 20:10 cgwalters

One thing this will also help is avoiding the need for the host kernel to support a specific filesystem type (e.g. rhel kernels don't include btrfs).

cgwalters avatar Nov 04 '24 22:11 cgwalters

One thing that came up related to this in a side chat is that while tooling exists to do "simple partition" setup and basic filesystem population, more complex storage (such as LVM) aren't yet ready to be initialized in this way.

I think at a basic level if we show the value in this outside of LVM, that provides motivation to make it work there.

The second thing is: I believe that for truly complex storage it makes sense to split it up into two parts:

  • bootstrap: the stuff that must be in the partition table or filesystem to boot
  • firstboot: Things we can initialize from inside the running OS on firstboot or optionally subsequent boots

This is a quite important topic because whether partitioning/filesystem setup is defined in the OS or external to it has implications for e.g. factory reset.

I think for LVM for example, if the user wants something like / to be a 100G VG, and /var/lib/postgres to be a 1T VG, tools that are generating disk images should actually just setup the basic / VG, and synthesize systemd units that on firstboot initialize the /var/lib/postgres VG. (This whole topic of course snowballs fast into "declarative state" and not defining things via imperative mutation, such as Ansible playbooks which aim to do that for LVM, etc.)

cgwalters avatar Nov 05 '24 19:11 cgwalters

Seeing sudo podman run ... --privileged ... --security-opt label=type:unconfined_t in https://github.com/osbuild/bootc-image-builder 's README.md has been a quite a big hurdle in my exploration of bootc and how it could work as an ecosystem for developers. Maybe not from a technical point of view, but from mental POV and POV of insight into the technologies required.

If I need more-or-less full root to convert the bootc image to a disk format, then using virt-manager with qemu:///session and then explicitly calling necessary commands as root in that VM feels more manageable than the seemingly monolithic quay.io/centos-bootc/bootc-image-builder:latest sudo podman approach.

Ideally I'd like to be able to produce the disk format with unprivileged tools or rootless podman containers, perhaps with some limited privileged preparation steps like modprobe fuse. I'm on Fedora / Linux, and if not all options or filesystems are supported in the first iteration, that's fine.

I tried that with

$ podman run --rm -ti --device=/dev/fuse --cap-add=SYS_ADMIN registry.fedoraproject.org/fedora

I can then run

container# dnf install -y util-linux e2fsprogs fuse3
container# dd if=/dev/zero of=/mnt/disk1.img bs=1 count=0 seek=200M
container# mkfs -t ext4 /mnt/disk1.img
container# mkdir /mnt/mount
container# fuse2fs /mnt/disk1.img /mnt/mount
container# ls -la /mnt/mount/
container# touch /mnt/mount/file
container# chown 934:934 /mnt/mount/file

in that rootless container and things work. So it seems that some part of the conversion to the disk image, at least for the specific filesystem type, could be achieved unprivileged.

I do get

container# chcon -t etc_t /mnt/mount/file
chcon: failed to change context of '/mnt/mount/file' to ‘system_u:object_r:etc_t:s0’: Operation not supported

but that then gets us to the question if a /.autorelabel firstboot relabel might be a viable approach.

Am I at least partially looking at the problem from the side that matches yours? Could you elaborate why https://github.com/osbuild/bootc-image-builder/issues/98 was not the right repo / place to discuss this or work on this?

adelton avatar Dec 13 '24 21:12 adelton

container# fuse2fs /mnt/disk1.img /mnt/mount

This is the key part - yes, we could use fuse2fs to generate ext4 filesystems, but no equivalent to that exists to my knowledge for xfs or btrfs for example. I think fuse2fs is also not kept up to date with ext4 features very much since not many people use it.

The most well-maintained path for this (as such) is the "protofile" approach implemented by various mkfs.* tools that basically support passing a moral equivalent of a tarball.

cgwalters avatar Mar 26 '25 21:03 cgwalters

The mkfs.* protofile path has also an issue with xfs: I think that mkfs.xfs doesn't support xattrs (so selinux doesn't work properly).

What about the libguestfs? My team recently discussed various options for rootless image builds, and this seems like the most viable path. You are basically trading a somewhat slower build (depends on KVM availability) for a rootless functionality. I'm just not sure what's the situation with ppc64le on EL9+, I know RH dropped some virtualization options there.

ondrejbudai avatar Mar 28 '25 11:03 ondrejbudai

The mkfs.* protofile path has also an issue with xfs: I think that mkfs.xfs doesn't support xattrs (so selinux doesn't work properly).

Yeah honestly it's a mess; also we hit on mkfs.ext4 not supporting fsverity.

If I could bikeshed here I would probably push all the filesystems to support something like the composefs dumpfile format, which has everything we want.

What about the libguestfs?

Definitely, this is now covered in https://github.com/osbuild/bootc-image-builder/issues/98#issuecomment-2755749338 (though I would s/libguestfs/supermin/ personally for reasons I outlined there)

I'm just not sure what's the situation with ppc64le on EL9+, I know RH dropped some virtualization options there.

Yeah that one was heading to be a nightmare (ref https://github.com/coreos/coreos-assembler/issues/2782 - and notice there how long some of these conversations have gone on in different forms...)

But I think as of current CentOS Stream 10 the status quo is ppc64le + KVM is back, though I'm not 100% certain.

cgwalters avatar Mar 28 '25 12:03 cgwalters

For now, what landed in https://github.com/bootc-dev/bcvk/ solves the core problem here.

It does still make sense I think to have a protofile-based install path at some point, but we can track that separately.

cgwalters avatar Oct 08 '25 13:10 cgwalters