multiarch/qemu-user-static not working on 1.23
This is a pretty strange one, but I wanted to raise it in case someone else was hitting it.
In a CICD context, we build multiarch images on x64 by executing the multiarch/qemu-user-static image before trying to build arm64 and amd64 docker images.
We use this command to do this :
podman run --authfile /run/containers/0/auth-ecr.json --rm --privileged multiarch/qemu-user-static --reset -p yes
As far as I know this registers alternative executable types in the kernel, so QEMU will know when to step in to emulate the commands. This has been working for quite a while, but when 1.23 was pushed out, this process broke.
Expected results : Image to build correctly
Actual results: Image build fails with exec format error
[2024-09-26T06:40:10.212Z] + buildah build --authfile /run/containers/0/auth-ecr.json --layers --format oci --pull --network=host --ulimit nofile=24000:24000 --squash --jobs 2 --platform=linux/amd64,linux/arm64 --manifest jobs-job-service:0.0.550-test-builds-gbucknel-b4-g2b0d41d .
[202[2024-09-26T06:40:23.636Z] process failed to start with error: fork/exec /bin/sh: exec format errorprocess exited with error: exec: not startedsubprocess exited with status 1
[2024-09-26T06:40:30.904Z] Error: [linux/arm64]: building at STEP "RUN mkdir -p /usr/apps/service-config": exit status 1
script returned exit code 1
We were able to get everything working again by going back to 1.22.
Also , running the register script by hand in a sheltie session on the host seems to make arm64 builds work in 1.23 again, so trying to do the same thing in a bootstrap container is something I'd like to try.
Looking at the changelog, I'm a bit puzzled as to where the change could've been. Will continue to try to figure it out. If anyone else has seen this, it would be really interesting.
I tried using a bootstrap container - it wasn't any different to executing it from a normal container. It looks like this is the issue ?
https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.107
commit 53477032977930f459293b7c244c348d5667c574
Author: Christian Brauner <[email protected]>
Date: Thu Oct 28 12:31:13 2021 +0200
binfmt_misc: cleanup on filesystem umount
I guess because this is executed in a container typically - with this new kernel once the container exits , the binfmt entries are unmounted and thus the arm64 commands I'm trying to run aren't recognized.
I guess if it was executed in some sort of non containerized bootstrap script, the entries would persist. Need to figure out if that would work.
Nice work tracking this down!
with this new kernel once the container exits , the binfmt entries are unmounted and thus the arm64 commands I'm trying to run aren't recognized.
This seems like a pretty clear regression, and one that breaks a couple of the popular "fire and forget" solutions for deploying binfmt support:
- https://github.com/tonistiigi/binfmt
- https://github.com/multiarch/qemu-user-static
I'll see about reporting this to LKML.
I guess if it was executed in some sort of non containerized bootstrap script, the entries would persist. Need to figure out if that would work.
I'll also try patching host-ctr to pass in the host's /proc/sys/fs/binfmt_misc so that bootstrap containers can populate that.
I'll also try patching host-ctr to pass in the host's /proc/sys/fs/binfmt_misc so that bootstrap containers can populate that.
This doesn't work, since it runs afoul of the checkProcMount safety check:
[ 74.666444] host-containers@admin[1381]: time="2024-09-27T17:33:30Z" level=fatal msg="failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting \"/proc/sys/fs/binfmt_misc\" to rootfs at \"/proc/sys/fs/binfmt_misc\": create mount destination for /proc/sys/fs/binfmt_misc mount: check proc-safety of /proc/sys/fs/binfmt_misc mount: \"/run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs/proc/sys/fs/binfmt_misc\" cannot be mounted because it is inside /proc: unknown"
hi @bcressey , thanks for looking at this. I don't suppose there is a way to add a "non containerized" init script to a bottlerocket node without building your own image? I had a look but couldn't find anything. It makes sense there isn't another way to do it.
@gbucknel that's correct - containers are the only way to run custom code on Bottlerocket.
While poking at this, I noticed that the binfmt_misc filesystem isn't mounted on the host by default, because the binfmt feature is disabled. That's something that'd be good to fix, though I'm not sure if it would work around the new kernel's behavior on its own.
@gbucknel I was able to get the following bootstrap container working. The two challenges involved were that the host didn't have its own binfmt_misc mount already, and the SELinux labels for the qemu-*-static binaries were overly restrictive because they originated in a bootstrap container.
Dockerfile:
FROM multiarch/qemu-user-static
ADD binfmt-install ./
RUN chmod +x ./binfmt-install
ENTRYPOINT ["sh", "binfmt-install"]
binfmt-install:
#!/bin/bash
set -euxo pipefail
exec 1>&2
# Create the mount point on the host.
mkdir -p /.bottlerocket/rootfs/mnt/binfmt_misc
# Mount the binfmt_misc filesystem. It will propagate back to the host
# because this location is set up as an "rshared" mount.
mount binfmt_misc -t binfmt_misc /.bottlerocket/rootfs/mnt/binfmt_misc
# Bind mount the binfmt_misc filesystem to the expected location under
# /proc/sys/fs/binfmt_misc. Otherwise the QEMU script will mount a second
# copy.
mount --bind /.bottlerocket/rootfs/mnt/binfmt_misc /proc/sys/fs/binfmt_misc
# Because we're running as a bootstrap container, the QEMU binaries all
# have the "secret_t" SELinux label, which prevents unprivileged containers
# from mapping them into memory. Copy the binaries to host path where they
# will have the "local_t" label instead, after removing any previous copies.
export QEMU_BIN_DIR=/.bottlerocket/rootfs/local/qemu/bin
mkdir -p "${QEMU_BIN_DIR}"
rm -f "${QEMU_BIN_DIR}"/qemu-*-static
cp /usr/bin/qemu-*-static "${QEMU_BIN_DIR}"
# Now run the registration script!
./register --reset -p yes
@bcressey !!! That's so cool, I'll try it out, thank you!
Happy to help!
If it's easier to integrate, I expect it'd be possible to make this work in a k8s pod also, with a spec like this:
apiVersion: v1
kind: Pod
metadata:
name: qemu-static
spec:
volumes:
- name: mnt-dir
hostPath:
path: /mnt
- name: local-dir
hostPath:
path: /local
containers:
- name: qemu-static
image: multiarch/qemu-user-static:latest
command: [ "..." ]
volumeMounts:
# this provides an "rshared" mount to send mounts back to the host
- mountPath: /.bottlerocket/rootfs/mnt
name: mnt-dir
mountPropagation: Bidirectional
# this provides a mount with the "local_t" label for correct labeling
- mountPath: /.bottlerocket/rootfs/local
name: local-dir
securityContext:
privileged: true
@bcressey , thanks again, I've tested your container and it works well. Appreciate the pod yaml as well since I'm not sure if a bootstrap container is the right approach - I had quite a few nodes silently fail while playing with this today.
Was just wondering, if the binfmt feature was enabled in systemd for bottlerocket, do you think this container wouldn't be required? Would that be a better solution? Perhaps not because it isn't as secure?
I was testing with AL2023 yesterday just to try to repro the issue and was surprised to find that it worked fine on the newer kernel and the multiarch container. Maybe that's because systemd is handling the state of the proc filesystem ?
I'll close this issue if you think using the multiarch container is the right way forward here.
Was just wondering, if the binfmt feature was enabled in systemd for bottlerocket, do you think this container wouldn't be required? Would that be a better solution? Perhaps not because it isn't as secure?
@gbucknel I'm planning to enable that on the Bottlerocket side. Not sure whether this would make it work in your environment or not, but it's worth a try. It sounds like it might help if AL2023 is working.
@bcressey oh that's awesome, let me know if I can test anything.
hi @bcressey , I was wondering if the change to systemd was targeted for 1.25? It doesn't seem like there is a change to the file you linked to ?
hi @bcressey , I was wondering if the change to systemd was targeted for 1.25? It doesn't seem like there is a change to the file you linked to ?
@gbucknel It won't make the release train for 1.25; I need to finish writing a test for the SELinux policy changes, and also verify they are actually required.
Previously the host did not mount binfmt_misc at all, so writing to it required CAP_SYS_ADMIN in order to mount it. With that barrier gone, there needs to be some other check (beyond UID 0) or else the security posture will change.
For your binfmt misc installer - can you confirm you are running that inside a privileged pod? The SELinux rule I've tested would effectively require that.
No problem, just thought I'd check.
At the moment, I run
podman run --authfile /run/containers/0/auth-ecr.json --rm --privileged 307943323221.dkr.ecr.us-east-1.amazonaws.com/multiarch/qemu-user-static --reset -p yes
So yes, I am in a privileged pod . Thanks!
@bcressey hurrah! Thanks so much. Would this be in 1.27 ? If so will keep an eye out for it and report back here.
@bcressey hurrah! Thanks so much. Would this be in 1.27 ? If so will keep an eye out for it and report back here.
Yup, it should go out with 1.27 which is planned for next week.
(There's a 1.26.2 release due out soon but the fix missed the train for that one.)
Thank you again @bcressey - we've tested this extensively and it's great. Was cool to be able to remove that "hack" from our build processes.
You're welcome, glad to hear it's working for you!