runc icon indicating copy to clipboard operation
runc copied to clipboard

rootfs: make pivot_root(2) dance handle initramfs case

Open cyphar opened this issue 1 year ago • 11 comments

While pivot_root(2) normally refuses to pivot a mount if you are running with / as initramfs (because initramfs doesn't have a parent mount), you can create a bind-mount of / and make that your new root to work around this problem. This does use chroot(2), but this is only done temporarily to set current->fs->root to the new mount. Once pivot_root(2) finishes, the chroot(2) and / are gone.

Variants of this hack are fairly well known and is used all over the place (see 1, 2) but until now we have forced users to have a far less secure configuration with --no-pivot. This is a slightly modified version that uses the container rootfs as the temporary spot for the / clone -- this allows runc to continue working with read-only image-based OS images.

Signed-off-by: Aleksa Sarai [email protected]

cyphar avatar Oct 10 '24 00:10 cyphar

Okay, I managed to test this and this version definitely works. The setup is not too complicated, but I'm not sure if we could practically test this within our CI (what would be a nice way of verifying the container ran inside the VM?).

Script used to create the initramfs (openSUSE)
#!/bin/bash

set -Eeuo pipefail

#sudo zypper in -y busybox syslinux skopeo umoci

sudo rm -rf boot-img/

[ -e runc ] || curl -sSL "https://github.com/opencontainers/runc/releases/download/v1.2.0/runc.amd64" -o runc

[ -d opensuse ] || skopeo copy docker://opensuse/tumbleweed:latest oci:opensuse:tumbleweed
[ -d bundle ] || sudo umoci unpack --image opensuse:tumbleweed ./bundle

mkdir -vp boot-img/

pushd boot-img

mkdir -p ./usr/bin
ln -sv usr/bin ./bin

# Copy runc.
ln ../runc ./usr/bin/runc
# Copy the rootfs bundle.
mkdir -p run
sudo cp -aR ../bundle ./run/bundle

# openSUSE makes /usr/bin/busybox non-static, and you can't ask busybox.install
# to install the static version. So install busybox using symlinks and then
# replace busybox with busybox-static.
busybox.install . --symlinks
cp -v /usr/bin/busybox-static ./usr/bin/busybox

# Boot into a shell.
cat >./init <<EOF
#!/bin/sh

echo "HELLO WORLD"

mkdir -p /proc
mount -t proc proc /proc

mkdir -p /sys
mount -t sysfs sysfs /sys

mkdir -p /sys/fs/cgroup
mount -t cgroup2 cgroup2 /sys/fs/cgroup

mkdir -p /tmp
mount -t tmpfs tmpfs /tmp

mkdir -p /dev
mount -t devtmpfs devtmpfs /dev
mkdir -p /dev/pts
mount -t devpts -o newinstance devpts /dev/pts
mkdir -p /dev/shm
mount --bind /tmp /dev/shm

/bin/sh
EOF
chmod +x ./init

# Build our init.cpio.
sudo find . | sudo cpio -o -H newc > ../init.cpio
popd

And you can then just do qemu-system-x86_64 -kernel /boot/vmlinuz -initrd ./init.cpio -m 2G -nographic -append console=ttyS0 to run a VM with this setup. You can verify this new version works by just doing runc run -b /run/bundle ctr.

cyphar avatar Oct 25 '24 07:10 cyphar

@kolyshkin @AkihiroSuda Do you want me to try to come up with a CI test for this case? I'm not really sure if there is a nice way of testing the output of qemu (doesn't GitHub block nested VMs as well?). Then again, maybe we could make the init script run runc and just parse the -nographic -append console=ttyS0 output?

cyphar avatar Oct 27 '24 13:10 cyphar

I'm afraid it's going to be tough. In case qemu is not working in GHA (last time I checked it was only working on Mac OS X instances, but it was ~2y ago), try cirrus-ci, it uses GCP and with this instance it's possible to use nested virt (which we do to run vagrant-libvirt):

https://github.com/opencontainers/runc/blob/4ad9f7fd36bdbc931422d2ee446c68311145f519/.cirrus.yml#L22-L29

kolyshkin avatar Oct 28 '24 01:10 kolyshkin

The default Linux instances of GHA now supports nested virt. (Used in the CI of containerd, Lima, etc.)

AkihiroSuda avatar Oct 28 '24 01:10 AkihiroSuda

Okay, I managed to get this working on AlmaLinux 9 so now it's ready for review again @kolyshkin @AkihiroSuda.

cyphar avatar Oct 29 '24 06:10 cyphar

I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.

rata avatar Nov 01 '24 16:11 rata

I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.

This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.

kolyshkin avatar Nov 02 '24 00:11 kolyshkin

Yeah, 1.2.1 was the wrong thing to label it. (Especially since there are some subtleties I need to look into...)

cyphar avatar Nov 02 '24 00:11 cyphar

This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.

Removed the backport/1.2-todo label.

@cyphar is this still a draft?

kolyshkin avatar Nov 13 '24 19:11 kolyshkin

@kolyshkin Yes, I still need to figure out how to fix the issue that @lifubang found. With the right setup you can end up with the wrong root being mounted in the container, and it's not obvious looking at the source of pivot_root why that's happening...

cyphar avatar Nov 14 '24 02:11 cyphar

This won't be ready for 1.3.0. I didn't manage to debug the issue that @lifubang found yet.

cyphar avatar Feb 26 '25 05:02 cyphar