runc
runc copied to clipboard
rootfs: make pivot_root(2) dance handle initramfs case
While pivot_root(2) normally refuses to pivot a mount if you are running
with / as initramfs (because initramfs doesn't have a parent mount), you
can create a bind-mount of / and make that your new root to work around
this problem. This does use chroot(2), but this is only done temporarily
to set current->fs->root to the new mount. Once pivot_root(2) finishes,
the chroot(2) and / are gone.
Variants of this hack are fairly well known and is used all over the
place (see 1, 2) but until now we have forced users to have a far less
secure configuration with --no-pivot. This is a slightly modified
version that uses the container rootfs as the temporary spot for the /
clone -- this allows runc to continue working with read-only image-based
OS images.
Signed-off-by: Aleksa Sarai [email protected]
Okay, I managed to test this and this version definitely works. The setup is not too complicated, but I'm not sure if we could practically test this within our CI (what would be a nice way of verifying the container ran inside the VM?).
Script used to create the initramfs (openSUSE)
#!/bin/bash
set -Eeuo pipefail
#sudo zypper in -y busybox syslinux skopeo umoci
sudo rm -rf boot-img/
[ -e runc ] || curl -sSL "https://github.com/opencontainers/runc/releases/download/v1.2.0/runc.amd64" -o runc
[ -d opensuse ] || skopeo copy docker://opensuse/tumbleweed:latest oci:opensuse:tumbleweed
[ -d bundle ] || sudo umoci unpack --image opensuse:tumbleweed ./bundle
mkdir -vp boot-img/
pushd boot-img
mkdir -p ./usr/bin
ln -sv usr/bin ./bin
# Copy runc.
ln ../runc ./usr/bin/runc
# Copy the rootfs bundle.
mkdir -p run
sudo cp -aR ../bundle ./run/bundle
# openSUSE makes /usr/bin/busybox non-static, and you can't ask busybox.install
# to install the static version. So install busybox using symlinks and then
# replace busybox with busybox-static.
busybox.install . --symlinks
cp -v /usr/bin/busybox-static ./usr/bin/busybox
# Boot into a shell.
cat >./init <<EOF
#!/bin/sh
echo "HELLO WORLD"
mkdir -p /proc
mount -t proc proc /proc
mkdir -p /sys
mount -t sysfs sysfs /sys
mkdir -p /sys/fs/cgroup
mount -t cgroup2 cgroup2 /sys/fs/cgroup
mkdir -p /tmp
mount -t tmpfs tmpfs /tmp
mkdir -p /dev
mount -t devtmpfs devtmpfs /dev
mkdir -p /dev/pts
mount -t devpts -o newinstance devpts /dev/pts
mkdir -p /dev/shm
mount --bind /tmp /dev/shm
/bin/sh
EOF
chmod +x ./init
# Build our init.cpio.
sudo find . | sudo cpio -o -H newc > ../init.cpio
popd
And you can then just do qemu-system-x86_64 -kernel /boot/vmlinuz -initrd ./init.cpio -m 2G -nographic -append console=ttyS0 to run a VM with this setup. You can verify this new version works by just doing runc run -b /run/bundle ctr.
@kolyshkin @AkihiroSuda Do you want me to try to come up with a CI test for this case? I'm not really sure if there is a nice way of testing the output of qemu (doesn't GitHub block nested VMs as well?). Then again, maybe we could make the init script run runc and just parse the -nographic -append console=ttyS0 output?
I'm afraid it's going to be tough. In case qemu is not working in GHA (last time I checked it was only working on Mac OS X instances, but it was ~2y ago), try cirrus-ci, it uses GCP and with this instance it's possible to use nested virt (which we do to run vagrant-libvirt):
https://github.com/opencontainers/runc/blob/4ad9f7fd36bdbc931422d2ee446c68311145f519/.cirrus.yml#L22-L29
The default Linux instances of GHA now supports nested virt. (Used in the CI of containerd, Lima, etc.)
Okay, I managed to get this working on AlmaLinux 9 so now it's ready for review again @kolyshkin @AkihiroSuda.
I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.
I'm assuming we won't include this in 1.2.1 as this is still a draft. We can include it later, though.
This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.
Yeah, 1.2.1 was the wrong thing to label it. (Especially since there are some subtleties I need to look into...)
This looks more like a "new feature" than a "bug fix" for me, and so it is probably 1.3 material.
Removed the backport/1.2-todo label.
@cyphar is this still a draft?
@kolyshkin Yes, I still need to figure out how to fix the issue that @lifubang found. With the right setup you can end up with the wrong root being mounted in the container, and it's not obvious looking at the source of pivot_root why that's happening...
This won't be ready for 1.3.0. I didn't manage to debug the issue that @lifubang found yet.