do we need to run buildah containers always with BUILDAH_ISOLATION = chroot
Hi,
I have a buildah container image (quay.io/buildah/stable:latest) running with default setting as a "BUILDAH_ISOLATION = chroot" in Kubernetes. However, I am wondering is this really required to run the buildah as a container ?
Can someone pleas explain this , https://github.com/containers/buildah/blob/main/docs/buildah-build.1.md _"--isolation type
Controls what type of isolation is used for running processes as part of RUN instructions. Recognized types include oci (OCI-compatible runtime, the default), rootless (OCI-compatible runtime invoked using a modified configuration, with --no-new-keyring added to its create invocation, reusing the host's network and UTS namespaces, and creating private IPC, PID, mount, and user namespaces; the default for unprivileged users), and chroot (an internal wrapper that leans more toward chroot(1) than container technology, reusing the host's control group, network, IPC, and PID namespaces, and creating private mount and UTS namespaces, and creating user namespaces only when they're required for ID mapping).
Note: You can also override the default isolation type by setting the BUILDAH_ISOLATION environment variable. export BUILDAH_ISOLATION=oci"_
In many cases, a container that's run using the image will not be given enough privileges for buildah run or the handling of RUN instructions in Dockerfiles in buildah build to be able to launch a container using an actual runtime like crun or runc. The chroot-based method is severely limited in functionality compared to crun or runc, but in return it exercises fewer privileges than they might, so it works (or "works") in a number of cases where they might not. If your environment provides enough privileges to not have to use chroot, feel free to override it.
Thanks @nalind for your reply. The documentation says the default value is "oci" for the BUILDAH_ISOLATION but in the dockerfile of the image quay.io/buildah/stable:latest ; it appears to be having the BUILDAH_ISOLATION=chroot https://github.com/containers/image_build/blob/main/podman/Containerfile https://github.com/containers/image_build/blob/main/buildah/Containerfile
Yes, the container image has the environment variable set in it to override the compiled-in default.
I have a similar need to run buildah in Kubernetes with better isolation.
If your environment provides enough privileges to not have to use chroot, feel free to override it.
What privileges are those? How can I check if the environment provides them?
For handling RUN instructions, it's a combination of
- Being able to create a user namespace with multiple IDs mapped into it, or being started as UID 0 and having CAP_SYS_ADMIN, so that it doesn't need to do those things to set up a namespace where those things are true. If you're writing the pod spec,
hostUsers: falsemay provide some of this. - Being able to create bind and overlay mounts for volumes (this generally requires CAP_SYS_ADMIN) that it provides.
- Being able to chroot into the rootfs to make changes inside of it (CAP_SYS_CHROOT).
- Being able to configure networking for a namespace that it creates if "host" networking isn't specified. There's no reason to not use "host" networking when we're in a container, because from buildah's point of view, the container's network is the host network, but that's configurable, and the hard-coded defaults don't assume being run inside of a container.
- Being able to successfully execute a command using runc, or crun, or a comparable runtime that can be invoked similarly. That last part introduces some requirements of its own that we don't have control over.
Some of these operations can also be denied by the seccomp filter, or by the SELinux policy (or other mandatory access control rules), and it's entirely possible that I'm still forgetting some things. For me, it tends to be a trial-and-error process.
A friendly reminder that this issue had no activity for 30 days.
I've had some time to play with it. I ended up with a Pod definition that seemingly makes nested containerization possible with BUILDAH_ISOLATION=oci
buildah-pod.yaml (click to expand)
apiVersion: v1
kind: Pod
metadata:
generateName: buildah-
labels:
buildah-isolation-test: "true"
annotations:
# /dev/fuse fixes:
#
# fuse: device not found, try 'modprobe fuse' first
#
# Wouldn't be needed with STORAGE_DRIVER=vfs
io.kubernetes.cri-o.Devices: /dev/fuse
spec:
restartPolicy: Never
volumes:
- name: workdir
emptyDir: {}
initContainers:
- name: create-dockerfile
image: quay.io/containers/buildah:v1.38.1
volumeMounts:
- name: workdir
mountPath: /workdir
workingDir: /workdir
command: ["bash", "-c"]
args:
- |-
cat << EOF > Dockerfile
FROM docker.io/library/alpine:latest
RUN echo "hello world"
EOF
containers:
- name: buildah
image: quay.io/containers/buildah:v1.38.1
volumeMounts:
- name: workdir
mountPath: /workdir
workingDir: /workdir
env:
- name: BUILDAH_ISOLATION
value: oci
- name: STORAGE_DRIVER
value: overlay
command: ["bash", "-c"]
# unshare fixes:
#
# error running container: from /usr/bin/crun ... opening file `/sys/fs/cgroup/cgroup.subtree_control` for writing: Read-only file system
#
# --mount fixes:
#
# Error: mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted
#
# --map-root-user fixes:
#
# unshare: unshare failed: Operation not permitted
# --net=host fixes:
#
# error running container: from /usr/bin/crun ...: open `/proc/sys/net/ipv4/ping_group_range`: Read-only file system
#
# --pid=host fixes:
#
# error running container: from /usr/bin/crun ...: mount `proc` to `proc`: Operation not permitted
args:
- |-
# can also add --pid --fork to unshare
unshare --map-root-user --mount -- buildah build --net=host --pid=host .
securityContext:
capabilities:
add:
# SETFCAP fixes:
#
# unshare: write failed /proc/self/uid_map: Operation not permitted
- SETFCAP
seLinuxOptions:
# container_runtime_t fixes:
#
# error running container: from /usr/bin/crun ...: mount `devpts` to `dev/pts`: Permission denied
type: container_runtime_t
Test with:
kubectl delete pod -l buildah-isolation-test=true
kubectl create -f buildah-pod.yaml
sleep 5
kubectl logs -l buildah-isolation-test=true --tail=-1 --follow
@nalind could you share your thoughts on the security implications of the settings I had to use:
--net=host --pid=hostfor buildah- You mentioned
--net=hostwould be OK to use, does the same apply for--pid=host? - Could also be combined with
unshare --pid --fork, which may help mitigate potential implications of--pid=host?
- You mentioned
SETFCAPfor the pod to enableunshare --map-root-usercontainer_runtime_tSELinux label for the pod to get aroundmount `devpts` to `dev/pts`: Permission deniedfrom crun
When attempting to nest a container, the "host" namespaces are those being used by the container. If it runs, great. Aside: with kernel 5.11 or on RHEL 8.5 or later, you shouldn't need to bother with fuse-overlayfs. The kernel's overlay implementation is available and should be fine if storage is an emptyDir volume (or more specifically, not on an overlay filesystem, which is what the container rootfs is on), so you can probably add an emptyDir volume and drop anything that's there purely to make /dev/fuse available to the pod.
Here's what I'm using as a one-liner reference example that seems to work right now rootless (it will also work rootful but I think it needs more security for that, see below):
$ podman run -v /var/lib/containers --security-opt=label=disable --cap-add=all --rm -ti
quay.io/centos-bootc/centos-bootc:stream10 \
podman run --rm -ti --net=host --cgroups=disabled \
busybox echo hello
Now of these alone, having to do --security-opt=label=disable on the outer container seems like a really important thing to fix. I find that surprising offhand...it looks like this must be something under dontaudit as I don't see a corresponding avc denial.
As far as security I'd emphasive from my PoV, as long as the outer container is invoked with a user namespace (as Nalin mentions, hostUsers: false from the pod spec) that provides a really key security level. Using unshare --map-root-user inside the container is suboptimal in that it's trying to constrain subprocesses of the inner container from inside it.
Thanks for the suggestions!
Adding hostUsers: false didn't break anything, added that to the Pod spec 👍 (also started a repo for this to track the changes better https://github.com/chmeliik/buildah-isolation-test)
I tried removing the unshare command and adding --cgroupns=host to the buildah command, but that still failed on opening file `/sys/fs/cgroup/cgroup.subtree_control` for writing: Read-only file system. So I'm keeping unshare for now.
Unfortunately, mounting /var/lib/containers doesn't seem to work for me, neither locally with podman run nor in Kubernetes. I still get this error despite the kernels seemingly being new enough (5.14.0-284.100.1.el9_2.x86_64 on the Kubernetes node, 6.12.10-100.fc40.x86_64 locally):
... using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime
fuse: device not found, try 'modprobe fuse' first
Now of these alone, having to do --security-opt=label=disable on the outer container seems like a really important thing to fix.
I found the labeling scary as well. It seems to work with type:container_runtime_t too, but I don't actually know what that means and how much better that is
... using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime fuse: device not found, try 'modprobe fuse' first
Wait, maybe that's just the config in the quay.io/containers/buildah image
Unfortunately, mounting
/var/lib/containersdoesn't seem to work for me, neither locally withpodman runnor in Kubernetes. I still get this error despite the kernels seemingly being new enough (5.14.0-284.100.1.el9_2.x86_64on the Kubernetes node,6.12.10-100.fc40.x86_64locally):... using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime fuse: device not found, try 'modprobe fuse' first
The image contains an /etc/containers/storage.conf which is configured to use fuse-overlayfs. You'll need to comment out the mount_program setting and remove the "fsync=0" argument from the mountopt setting to update the configuration to not use fuse-overlayfs.
That helped! 🎉
Any thoughts on having to set type:container_runtime_t? I found it also works with type:container_engine_t, which one would be more appropriate?
Outside of a container, it's usually labeled container_runtime_exec_t, so I would expect container_runtime_t to be the preferred domain to be run in.
Yes I think using --security-opt=label=type:container_runtime_t helps here. One thing that surprises me when I look is that the inner process is running with spc_t - that may be triggered by the --cap-add=all? Anyways I think from a security point of view by specifying just type here we still get the level (I think) based separation - the stuff at the end. The useful property of the SELinux policy here is to ensure that two distinct containers can't touch each other's state (and to provide host protection too).
Adding hostUsers: false didn't break anything, added that to the Pod spec 👍
Derp, I don't think it did anything at all. The cluster I was using to test probably doesn't enable the UserNamespacesSupport feature gate (https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/#before-you-begin). And its nodes don't meet the requirements anyway.
I'll try to get a cluster with user namespaces actually enabled.
In any case, it seems it's possible to make BUILDAH_ISOLATION=oci work without user namespaces, but requires
unshareusageSETFCAPon the Pod to enable unshare- the
container_runtime_tlabel on the Pod
That still seems preferable to using BUILDAH_ISOLATION=chroot without those things, what do you think?