bottlerocket multiarch/qemu-user-static not working on 1.23

This is a pretty strange one, but I wanted to raise it in case someone else was hitting it. In a CICD context, we build multiarch images on x64 by executing the multiarch/qemu-user-static image before trying to build arm64 and amd64 docker images.

We use this command to do this :

podman run --authfile /run/containers/0/auth-ecr.json --rm --privileged multiarch/qemu-user-static --reset -p yes

As far as I know this registers alternative executable types in the kernel, so QEMU will know when to step in to emulate the commands. This has been working for quite a while, but when 1.23 was pushed out, this process broke.

Expected results : Image to build correctly

Actual results: Image build fails with exec format error

[2024-09-26T06:40:10.212Z] + buildah build --authfile /run/containers/0/auth-ecr.json --layers --format oci --pull --network=host --ulimit nofile=24000:24000 --squash --jobs 2 --platform=linux/amd64,linux/arm64 --manifest jobs-job-service:0.0.550-test-builds-gbucknel-b4-g2b0d41d .
[202[2024-09-26T06:40:23.636Z] process failed to start with error: fork/exec /bin/sh: exec format errorprocess exited with error: exec: not startedsubprocess exited with status 1

[2024-09-26T06:40:30.904Z] Error: [linux/arm64]: building at STEP "RUN mkdir -p /usr/apps/service-config": exit status 1

script returned exit code 1

We were able to get everything working again by going back to 1.22.

Also , running the register script by hand in a sheltie session on the host seems to make arm64 builds work in 1.23 again, so trying to do the same thing in a bootstrap container is something I'd like to try.

Looking at the changelog, I'm a bit puzzled as to where the change could've been. Will continue to try to figure it out. If anyone else has seen this, it would be really interesting.

Sep 26 '24 08:09 gbucknel

I tried using a bootstrap container - it wasn't any different to executing it from a normal container. It looks like this is the issue ?

https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.107

commit 53477032977930f459293b7c244c348d5667c574
Author: Christian Brauner <[email protected]>
Date:   Thu Oct 28 12:31:13 2021 +0200

    binfmt_misc: cleanup on filesystem umount

I guess because this is executed in a container typically - with this new kernel once the container exits , the binfmt entries are unmounted and thus the arm64 commands I'm trying to run aren't recognized.

I guess if it was executed in some sort of non containerized bootstrap script, the entries would persist. Need to figure out if that would work.

Sep 27 '24 08:09 gbucknel

Nice work tracking this down!

with this new kernel once the container exits , the binfmt entries are unmounted and thus the arm64 commands I'm trying to run aren't recognized.

This seems like a pretty clear regression, and one that breaks a couple of the popular "fire and forget" solutions for deploying binfmt support:

https://github.com/tonistiigi/binfmt
https://github.com/multiarch/qemu-user-static

I'll see about reporting this to LKML.

I guess if it was executed in some sort of non containerized bootstrap script, the entries would persist. Need to figure out if that would work.

I'll also try patching host-ctr to pass in the host's /proc/sys/fs/binfmt_misc so that bootstrap containers can populate that.

Sep 27 '24 16:09 bcressey

I'll also try patching host-ctr to pass in the host's /proc/sys/fs/binfmt_misc so that bootstrap containers can populate that.

This doesn't work, since it runs afoul of the checkProcMount safety check:

[   74.666444] host-containers@admin[1381]: time="2024-09-27T17:33:30Z" level=fatal msg="failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting \"/proc/sys/fs/binfmt_misc\" to rootfs at \"/proc/sys/fs/binfmt_misc\": create mount destination for /proc/sys/fs/binfmt_misc mount: check proc-safety of /proc/sys/fs/binfmt_misc mount: \"/run/host-containerd/io.containerd.runtime.v2.task/default/admin/rootfs/proc/sys/fs/binfmt_misc\" cannot be mounted because it is inside /proc: unknown"

Sep 27 '24 17:09 bcressey

hi @bcressey , thanks for looking at this. I don't suppose there is a way to add a "non containerized" init script to a bottlerocket node without building your own image? I had a look but couldn't find anything. It makes sense there isn't another way to do it.

Sep 29 '24 23:09 gbucknel

@gbucknel that's correct - containers are the only way to run custom code on Bottlerocket.

While poking at this, I noticed that the binfmt_misc filesystem isn't mounted on the host by default, because the binfmt feature is disabled. That's something that'd be good to fix, though I'm not sure if it would work around the new kernel's behavior on its own.

Sep 30 '24 21:09 bcressey

@gbucknel I was able to get the following bootstrap container working. The two challenges involved were that the host didn't have its own binfmt_misc mount already, and the SELinux labels for the qemu-*-static binaries were overly restrictive because they originated in a bootstrap container.

Dockerfile:

FROM multiarch/qemu-user-static
ADD binfmt-install ./
RUN chmod +x ./binfmt-install
ENTRYPOINT ["sh", "binfmt-install"]

binfmt-install:

#!/bin/bash
set -euxo pipefail
exec 1>&2

# Create the mount point on the host.
mkdir -p /.bottlerocket/rootfs/mnt/binfmt_misc

# Mount the binfmt_misc filesystem. It will propagate back to the host
# because this location is set up as an "rshared" mount.
mount binfmt_misc -t binfmt_misc /.bottlerocket/rootfs/mnt/binfmt_misc

# Bind mount the binfmt_misc filesystem to the expected location under
# /proc/sys/fs/binfmt_misc. Otherwise the QEMU script will mount a second
# copy.
mount --bind /.bottlerocket/rootfs/mnt/binfmt_misc /proc/sys/fs/binfmt_misc

# Because we're running as a bootstrap container, the QEMU binaries all
# have the "secret_t" SELinux label, which prevents unprivileged containers
# from mapping them into memory. Copy the binaries to host path where they
# will have the "local_t" label instead, after removing any previous copies.
export QEMU_BIN_DIR=/.bottlerocket/rootfs/local/qemu/bin
mkdir -p "${QEMU_BIN_DIR}"
rm -f "${QEMU_BIN_DIR}"/qemu-*-static
cp /usr/bin/qemu-*-static "${QEMU_BIN_DIR}"

# Now run the registration script!
./register --reset -p yes

Sep 30 '24 21:09 bcressey

@bcressey !!! That's so cool, I'll try it out, thank you!

Sep 30 '24 21:09 gbucknel

Happy to help!

If it's easier to integrate, I expect it'd be possible to make this work in a k8s pod also, with a spec like this:

apiVersion: v1
kind: Pod
metadata:
  name: qemu-static
spec:
  volumes:
  - name: mnt-dir
    hostPath:
      path: /mnt
  - name: local-dir
    hostPath:
      path: /local
  containers:
  - name: qemu-static
    image: multiarch/qemu-user-static:latest
    command: [ "..." ]
    volumeMounts:
    # this provides an "rshared" mount to send mounts back to the host
    - mountPath: /.bottlerocket/rootfs/mnt
      name: mnt-dir
      mountPropagation: Bidirectional
    # this provides a mount with the "local_t" label for correct labeling
    - mountPath: /.bottlerocket/rootfs/local
      name: local-dir
    securityContext:
      privileged: true

Sep 30 '24 21:09 bcressey

@bcressey , thanks again, I've tested your container and it works well. Appreciate the pod yaml as well since I'm not sure if a bootstrap container is the right approach - I had quite a few nodes silently fail while playing with this today.

Was just wondering, if the binfmt feature was enabled in systemd for bottlerocket, do you think this container wouldn't be required? Would that be a better solution? Perhaps not because it isn't as secure?

I was testing with AL2023 yesterday just to try to repro the issue and was surprised to find that it worked fine on the newer kernel and the multiarch container. Maybe that's because systemd is handling the state of the proc filesystem ?
I'll close this issue if you think using the multiarch container is the right way forward here.

Oct 01 '24 07:10 gbucknel

Was just wondering, if the binfmt feature was enabled in systemd for bottlerocket, do you think this container wouldn't be required? Would that be a better solution? Perhaps not because it isn't as secure?

@gbucknel I'm planning to enable that on the Bottlerocket side. Not sure whether this would make it work in your environment or not, but it's worth a try. It sounds like it might help if AL2023 is working.

Oct 01 '24 18:10 bcressey

@bcressey oh that's awesome, let me know if I can test anything.

Oct 01 '24 20:10 gbucknel

hi @bcressey , I was wondering if the change to systemd was targeted for 1.25? It doesn't seem like there is a change to the file you linked to ?

Oct 16 '24 01:10 gbucknel

hi @bcressey , I was wondering if the change to systemd was targeted for 1.25? It doesn't seem like there is a change to the file you linked to ?

@gbucknel It won't make the release train for 1.25; I need to finish writing a test for the SELinux policy changes, and also verify they are actually required.

Previously the host did not mount binfmt_misc at all, so writing to it required CAP_SYS_ADMIN in order to mount it. With that barrier gone, there needs to be some other check (beyond UID 0) or else the security posture will change.

For your binfmt misc installer - can you confirm you are running that inside a privileged pod? The SELinux rule I've tested would effectively require that.

Oct 16 '24 01:10 bcressey

No problem, just thought I'd check.

At the moment, I run

podman run --authfile /run/containers/0/auth-ecr.json --rm --privileged 307943323221.dkr.ecr.us-east-1.amazonaws.com/multiarch/qemu-user-static --reset -p yes

So yes, I am in a privileged pod . Thanks!

Oct 16 '24 03:10 gbucknel

@bcressey hurrah! Thanks so much. Would this be in 1.27 ? If so will keep an eye out for it and report back here.

Nov 05 '24 20:11 gbucknel

@bcressey hurrah! Thanks so much. Would this be in 1.27 ? If so will keep an eye out for it and report back here.

Yup, it should go out with 1.27 which is planned for next week.

(There's a 1.26.2 release due out soon but the fix missed the train for that one.)

Nov 07 '24 02:11 bcressey

Thank you again @bcressey - we've tested this extensively and it's great. Was cool to be able to remove that "hack" from our build processes.

Nov 28 '24 05:11 gbucknel

You're welcome, glad to hear it's working for you!

Nov 28 '24 16:11 bcressey