sysbox qemu-multiarch support

Multi-arch buildx builds currently do not work on the sysbox runtime due to lack of support for this feature.

As per this slack message it is actively being worked on and this ticket is to track progress.

Sep 08 '22 17:09 mhornbacher

@rodnymolina in case he has further info on where we are at with running multiarch builds with Sysbox.

Sep 12 '22 23:09 ctalledo

To move forward with this, we first need to allow containerized processes to write into the /proc/sys/fs/binfmt_misc node in order to enable the different qemu arch-specific binaries to be invoked whenever required. And we need to do this securely.

Unfortunately, this isn't a trivial task, since this binfmt_misc node is a non-namespaced system-wide resource, so if we were to allow changes to it within a sysbox container, this could potentially open an attack vector on the overall system. That's just to say that we need to be careful here.

Notice that the solutions that currently exist to address this issue at the host level are not applicable to our case since they require the execution of --privileged containers to grant full access to the binfmt_misc node (something that we're trying to avoid as explained above).

Nov 27 '22 02:11 rodnymolina

I've found I can do multi-arch buildx builds on a dockerd running inside an unprivileged sysbox-runc container as long as the underlying OS where sysbox-runc is installed already has the qemu binaries registered under /proc/sys/fs/binfmt_misc/. Is this the expected behaviour?

If so, is the scope of this issue limited to just modifying or registering new binfmt_misc entries? In that case, I'd suggest we update the limitations page to clarify this.

Dec 16 '22 08:12 h4l

Interesting, could you elaborate on how you installed qemu binaries there?

Dec 16 '22 16:12 mhornbacher

Sure. This is a bit long, so I'll collapse it!

Details

(I found this post useful, I was kind of freestyling with ideas from it: https://medium.com/@artur.klauser/building-multi-architecture-docker-images-with-buildx-27d80f7e2408 .)

The top-level host OS is Ubuntu 20.04. I installed the qemu-user-static package there (along with binfmt-support which is a recommended package for it). This gives me various architectures registered:

$ ls /proc/sys/fs/binfmt_misc/
python3.8     qemu-alpha  qemu-armeb  qemu-hppa  qemu-microblaze  qemu-mips64    qemu-mipsel   qemu-mipsn32el  qemu-ppc64       qemu-ppc64le  qemu-riscv64  qemu-sh4    qemu-sparc        qemu-sparc64  qemu-xtensaeb  status
qemu-aarch64  qemu-arm    qemu-cris   qemu-m68k  qemu-mips        qemu-mips64el  qemu-mipsn32  qemu-ppc        qemu-ppc64abi32  qemu-riscv32  qemu-s390x    qemu-sh4eb  qemu-sparc32plus  qemu-xtensa   register
$ cat /proc/sys/fs/binfmt_misc/qemu-aarch64
enabled
interpreter /usr/bin/qemu-aarch64-static
flags: OCF
offset 0
magic 7f454c460201010000000000000000000200b700
mask ffffffffffffff00fffffffffffffffffeffffff

I've got sysbox-runc 0.5.2 installed too, and dockerd 20.10.14.

I've run buildkitd under sysbox-runc successfully in two ways. I'm setting up buildkitd to use with a GitLab CI runner. I'd initially tried and failed to do multiarch builds when running dockerd inside the GitLab CI Runner's container (which is how I found this issue). So after that I had been planning to set up a standalone buildkitd and share it with the runners as a remote builder, so setting that up is how I noticed that it actually works under sysbox-runc (after installing the qemu binaries on the top-level host OS):

$ docker network create buildkit-test
$ docker container run --rm --runtime sysbox-runc  --name=remote-buildkitd --network buildkit-test moby/buildkit:latest --addr tcp://0.0.0.0:1234
...

So this container seems to inherit the same binfmt registrations as the host:

$ docker container exec -it remote-buildkitd sh
/ # ls /proc/sys/fs/binfmt_misc/
python3.8         qemu-arm          qemu-hppa         qemu-mips         qemu-mipsel       qemu-ppc          qemu-ppc64le      qemu-s390x        qemu-sparc        qemu-xtensa       status
qemu-aarch64      qemu-armeb        qemu-m68k         qemu-mips64       qemu-mipsn32      qemu-ppc64        qemu-riscv32      qemu-sh4          qemu-sparc32plus  qemu-xtensaeb
qemu-alpha        qemu-cris         qemu-microblaze   qemu-mips64el     qemu-mipsn32el    qemu-ppc64abi32   qemu-riscv64      qemu-sh4eb        qemu-sparc64      register
/ # cat /proc/sys/fs/binfmt_misc/qemu-aarch64
enabled
interpreter /usr/bin/qemu-aarch64-static
flags: OCF
offset 0
magic 7f454c460201010000000000000000000200b700
mask ffffffffffffff00fffffffffffffffffeffffff

The actual binaries don't exist in the container though:

/ # stat /usr/bin/qemu-aarch64-static
stat: can't stat '/usr/bin/qemu-aarch64-static': No such file or directory

Then I can run another container with docker CLI in that network and connect it to this builder as a remote builder:

$ docker container run --rm -it --network buildkit-test docker
/ # docker buildx create --name remote-buildkitd --driver remote tcp://remote-buildkitd:1234
remote-buildkitd
/ # docker buildx ls
NAME/NODE           DRIVER/ENDPOINT             STATUS  BUILDKIT PLATFORMS
remote-buildkitd    remote
  remote-buildkitd0 tcp://remote-buildkitd:1234 running          linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
default *                                       error

Cannot load builder default *: error during connect: Get "http://docker:2375/_ping": dial tcp: lookup docker on 172.25.3.1:53: server misbehaving

And I just made a simple Dockerfile to do a test build, I think it was just:

FROM alpine
RUN echo "Hi from $(uname -a)"

And this builds successfully in arm64 and amd64:

/tmp/docker # docker buildx use remote-buildkitd
/tmp/docker # ls
Dockerfile
/tmp/docker # docker buildx build --platform linux/arm64,linux/amd64 --progress=plain .
WARNING: No output specified with remote driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
#1 [internal] load .dockerignore
#1 transferring context: 2B 0.0s done
#1 DONE 0.1s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 88B 0.0s done
#2 DONE 0.1s

#3 [linux/arm64 internal] load metadata for docker.io/library/alpine:latest
#3 DONE 2.5s

#4 [linux/amd64 internal] load metadata for docker.io/library/alpine:latest
#4 DONE 2.5s

#5 [linux/arm64 1/2] FROM docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
#5 resolve docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4 0.0s done
#5 sha256:261da4162673b93e5c0e7700a3718d40bcc086dbf24b1ec9b54bca0b82300626 3.26MB / 3.26MB 0.2s done
#5 extracting sha256:261da4162673b93e5c0e7700a3718d40bcc086dbf24b1ec9b54bca0b82300626
#5 extracting sha256:261da4162673b93e5c0e7700a3718d40bcc086dbf24b1ec9b54bca0b82300626 0.2s done
#5 DONE 0.4s

#6 [linux/amd64 1/2] FROM docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
#6 resolve docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4 0.0s done
#6 sha256:c158987b05517b6f2c5913f3acef1f2182a32345a304fe357e3ace5fadcad715 3.37MB / 3.37MB 0.3s done
#6 extracting sha256:c158987b05517b6f2c5913f3acef1f2182a32345a304fe357e3ace5fadcad715
#6 extracting sha256:c158987b05517b6f2c5913f3acef1f2182a32345a304fe357e3ace5fadcad715 0.1s done
#6 DONE 0.4s

#7 [linux/amd64 2/2] RUN echo "Hi from $(uname -a)"
#0 0.070 Hi from Linux buildkitsandbox 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 Linux
#7 DONE 0.2s

#8 [linux/arm64 2/2] RUN echo "Hi from $(uname -a)"
#0 0.135 Hi from Linux buildkitsandbox 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 aarch64 Linux
#8 DONE 0.2s

So that's fine. But with the qemu binaries installed on the host, I'm able to just start a buildkit builder on a dockerd running inside a CI build container. The CI runner uses the Docker executor. My config for the Docker executor is this:

# ...
  [runners.docker]
    tls_verify = true
    image = "ubuntu:20.04"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/cache", "/var/lib/docker"]
    shm_size = 0
    runtime = "sysbox-runc"
    extra_hosts = ["gitlab-ci-minio:host-gateway"]

(I guess I probably shouldn't share /var/lib/docker with the runners...). My CI job that runs my buildx build runs on an image I created that's basically https://github.com/nestybox/dockerfiles/tree/master/alpine-supervisord-docker — it contains its own dockerd. In the CI job all I do is

$ docker buildx create --use
$ docker buildx bake --push

So buildkitd is running under the dockerd running in the CI job's container. And this builds a real image with plenty of stuff in, not just the toy Dockerfile above.

Dec 16 '22 18:12 h4l

Thank you @h4l!

With some minor modifications we got it working in our Buildkite environment as well!

All our modifications where tweaks around using an existing per-job network as well as loading up the buildx configuration in every builds environment hooks :)

Jan 12 '23 21:01 mhornbacher

@mhornbacher Nice job, that's great, glad you got it working!

Jan 13 '23 07:01 h4l

@rodnymolina @ctalledo I think this is a fairly workable solution for most customers. Big shoutout to @h4l for his discoveries!

For anyone stumbling upon this via a search.

Install qemu-user-static and binfmt-support which fixes running moby/buildkit under sysbox.
Expose moby/buildkit to your container via the docker network (either via a custom bridge network or the default one) and use it as a remote builder

Implementation details are in the above two comments 👍

Feel free to @ me with a comment here if you need more explanation

Jan 25 '23 03:01 mhornbacher

Thanks @mhornbacher for your contribution to resolving this issue, much appreciated!

Jan 25 '23 19:01 ctalledo

Just a heads-up that the workaround mentioned above to allow multi-arch builds within a sysbox container is not working as of v0.6.2 release (thanks to @DekusDenial for letting us know).

This is a consequence of a recent change to ensure that Sysbox exposes all the procfs and sysfs nodes known within the container's namespaces. As a side-effect of this, we stopped exposing a few host nodes within sysbox containers, which on one hand offers greater security, but on the other, it breaks functionality like the one required by the above workaround.

As a fix, we could make an exception for these nodes by creating a new 'handler' in sysbox-fs to expose the host's /proc/sys/fs/binfmt_misc nodes. We will take this into account for the next release (ping us if you can't wait).

Jun 28 '23 17:06 rodnymolina

Thanks. Pinging @jmandel1027 :)

Jun 28 '23 17:06 mhornbacher

I have same problem to run CICD, any workaround? thanks.

Sep 05 '23 06:09 johnwmail

You can use version 0.6.1 with the workarounds detailed above if you need mutli arch builds for now. This is no longer my day-to-day so I am not on top of any 0.6.2 workarounds yet

Sep 11 '23 17:09 mhornbacher

I am using gitea with self host act_runner , seems to not use moby/buildkit at all.

  - name: Checkout
    uses: https://github.com/actions/checkout@v3       

  - name: Set up QEMU
    uses: https://github.com/docker/setup-qemu-action@v2

  - name: Set up Docker Buildx
    uses: https://github.com/docker/setup-buildx-action@v2

Sep 12 '23 04:09 johnwmail

Hi @DekusDenial,

will the [upcoming 0.6.3] release cover the fix for https://github.com/nestybox/sysbox/issues/592#issuecomment-1611783474?

I am not sure; I see a few related commits, but nothing that specifically addresses that issue. How can I reproduce the problem?

Jan 08 '24 20:01 ctalledo

I am not sure; I see a few related commits, but nothing that specifically addresses that issue. How can I reproduce the problem?

I enabled qemu on a host and run sysbox container but /proc/sys/fs/binfmt_misc was empty and didnt inherit from the host

Jan 08 '24 21:01 DekusDenial

I enabled qemu on a host and run sysbox container but /proc/sys/fs/binfmt_misc was empty and didnt inherit from the host

I see; no unfortunately the ability to expose (i.e., namespace) binfmt_misc inside the Sysbox container is not present in this release.

And exposing the host's binfmt_misc inside the container (as Sysbox used to do by mistake) is not a good solution because it's a global resource in the system (i.e., if a container modifies the binary associated with a file-type, all other containers would be affected by the change). Ideally binfmt_misc needs to be per-Sysbox container. It's not a simple task because it requires Sysbox to emulate binfmt_misc inside the container. It can be done but it's a difficult task, and we've not yet had the chance to work on it unfortunately.

Jan 09 '24 05:01 ctalledo

That’s the reason we still pin to 0.6.1 given the situation in which we have no other mean to enable qemu for multi-arch workload.

Jan 09 '24 12:01 DekusDenial

That’s the reason we still pin to 0.6.1 given the situation in which we have no other mean to enable qemu for multi-arch workload.

0.6.1 support it out of the box?

Jan 09 '24 14:01 johnwmail

That’s the reason we still pin to 0.6.1 given the situation in which we have no other mean to enable qemu for multi-arch workload.

Got it; would the work-around in the comment above help?

Jan 09 '24 17:01 ctalledo

Got it; would the work-around in the comment above help?

Unfortunately no because our setup and use case is not the same as above, and based on the reporter’s comment above this issue also prevented them from moving to 0.6.2 and the work around may no longer work.

Jan 09 '24 19:01 DekusDenial

Unfortunately no because our setup and use case is not the same as above, and based on the reporter’s comment above this issue also prevented them from moving to 0.6.2 and the work around may no longer work.

I see; not sure how to help then: we can't go back to the v0.6.1 behavior because it breaks container isolation (i.e., it allows a container to modify a globally shared system resource, the binfmt_misc subsystem). But on the other hand the proper fix is a heavy lift.

The only thing I can think of is adding a config option in Sysbox that allows the container to access the host's binfmt_misc; the config would be set per-container, via an env variable (e.g.,docker run --runtime=sysbox-runc -e SYSBOX_EXPOSE_HOST_BINFMT_MISC ...). This way the user can choose whether to do this at the risk of knowing that container isolation is reduced.

Jan 09 '24 19:01 ctalledo

The only thing I can think of is adding a config option in Sysbox that allows the container to access the host's binfmt_misc

I think most people would prefer this config as a work around for qemu. FYI, I used to have workload scheduled on a kata-container in which user can register qemu on-demand via privileged docker but this won’t work inside a sysbox container, that’s why I have been relying on the host qemu to pre-provide qemu.

Meanwhile, if you can provide the sources where this config would be implemented or another word where this binfmt_misc resource would be excluded from isolation, people can patch it on their side.

Jan 10 '24 02:01 DekusDenial

Meanwhile, if you can provide the sources where this config would be implemented or another word where this binfmt_misc resource would be excluded from isolation, people can patch it on their side.

It's a bit more complicated; the work would be in sysbox-fs (the component that emulates portions of /proc and /sys inside the container).

For the emulated files or directories within /proc and /sys, sysbox-fs has a concept of a "handler" that performs the emulation for a file or a directory. We would need to create a new handler for /proc/sys/fs/binfmt_misc, and that handler would expose the host's binfmt_misc into the container. The code for the other handlers is here.

In addition, we would need to add the code that enables the feature on a per-container basis. That requires changes in sysbox-runc and the transport that sends the config to sysbox-fs.

It's not super difficult, but it's not a simple change either. For someone that knows the code, it's a few days of work (we also have to write the tests). If you wish to contribute (we appreciate that!), let me know and I can provide more orientation.

Otherwise it will have to wait until we have the cycles, as we balance Sysbox development & maintenance with other work at Docker.

Jan 10 '24 19:01 ctalledo

I really look forward to this feature being available soon.

Apr 16 '24 05:04 hanxinhang

sysbox sysbox copied to clipboard

qemu-multiarch support

For anyone stumbling upon this via a search.

sysbox
sysbox copied to clipboard