sysbox
sysbox copied to clipboard
qemu-multiarch support
Multi-arch buildx builds currently do not work on the sysbox runtime due to lack of support for this feature.
As per this slack message it is actively being worked on and this ticket is to track progress.
@rodnymolina in case he has further info on where we are at with running multiarch builds with Sysbox.
To move forward with this, we first need to allow containerized processes to write into the /proc/sys/fs/binfmt_misc node in order to enable the different qemu arch-specific binaries to be invoked whenever required. And we need to do this securely.
Unfortunately, this isn't a trivial task, since this binfmt_misc node is a non-namespaced system-wide resource, so if we were to allow changes to it within a sysbox container, this could potentially open an attack vector on the overall system. That's just to say that we need to be careful here.
Notice that the solutions that currently exist to address this issue at the host level are not applicable to our case since they require the execution of --privileged containers to grant full access to the binfmt_misc node (something that we're trying to avoid as explained above).
I've found I can do multi-arch buildx builds on a dockerd running inside an unprivileged sysbox-runc container as long as the underlying OS where sysbox-runc is installed already has the qemu binaries registered under /proc/sys/fs/binfmt_misc/. Is this the expected behaviour?
If so, is the scope of this issue limited to just modifying or registering new binfmt_misc entries? In that case, I'd suggest we update the limitations page to clarify this.
Interesting, could you elaborate on how you installed qemu binaries there?
Sure. This is a bit long, so I'll collapse it!
Details
(I found this post useful, I was kind of freestyling with ideas from it: https://medium.com/@artur.klauser/building-multi-architecture-docker-images-with-buildx-27d80f7e2408 .)
The top-level host OS is Ubuntu 20.04. I installed the qemu-user-static package there (along with binfmt-support which is a recommended package for it). This gives me various architectures registered:
$ ls /proc/sys/fs/binfmt_misc/
python3.8 qemu-alpha qemu-armeb qemu-hppa qemu-microblaze qemu-mips64 qemu-mipsel qemu-mipsn32el qemu-ppc64 qemu-ppc64le qemu-riscv64 qemu-sh4 qemu-sparc qemu-sparc64 qemu-xtensaeb status
qemu-aarch64 qemu-arm qemu-cris qemu-m68k qemu-mips qemu-mips64el qemu-mipsn32 qemu-ppc qemu-ppc64abi32 qemu-riscv32 qemu-s390x qemu-sh4eb qemu-sparc32plus qemu-xtensa register
$ cat /proc/sys/fs/binfmt_misc/qemu-aarch64
enabled
interpreter /usr/bin/qemu-aarch64-static
flags: OCF
offset 0
magic 7f454c460201010000000000000000000200b700
mask ffffffffffffff00fffffffffffffffffeffffff
I've got sysbox-runc 0.5.2 installed too, and dockerd 20.10.14.
I've run buildkitd under sysbox-runc successfully in two ways. I'm setting up buildkitd to use with a GitLab CI runner. I'd initially tried and failed to do multiarch builds when running dockerd inside the GitLab CI Runner's container (which is how I found this issue). So after that I had been planning to set up a standalone buildkitd and share it with the runners as a remote builder, so setting that up is how I noticed that it actually works under sysbox-runc (after installing the qemu binaries on the top-level host OS):
$ docker network create buildkit-test
$ docker container run --rm --runtime sysbox-runc --name=remote-buildkitd --network buildkit-test moby/buildkit:latest --addr tcp://0.0.0.0:1234
...
So this container seems to inherit the same binfmt registrations as the host:
$ docker container exec -it remote-buildkitd sh
/ # ls /proc/sys/fs/binfmt_misc/
python3.8 qemu-arm qemu-hppa qemu-mips qemu-mipsel qemu-ppc qemu-ppc64le qemu-s390x qemu-sparc qemu-xtensa status
qemu-aarch64 qemu-armeb qemu-m68k qemu-mips64 qemu-mipsn32 qemu-ppc64 qemu-riscv32 qemu-sh4 qemu-sparc32plus qemu-xtensaeb
qemu-alpha qemu-cris qemu-microblaze qemu-mips64el qemu-mipsn32el qemu-ppc64abi32 qemu-riscv64 qemu-sh4eb qemu-sparc64 register
/ # cat /proc/sys/fs/binfmt_misc/qemu-aarch64
enabled
interpreter /usr/bin/qemu-aarch64-static
flags: OCF
offset 0
magic 7f454c460201010000000000000000000200b700
mask ffffffffffffff00fffffffffffffffffeffffff
The actual binaries don't exist in the container though:
/ # stat /usr/bin/qemu-aarch64-static
stat: can't stat '/usr/bin/qemu-aarch64-static': No such file or directory
Then I can run another container with docker CLI in that network and connect it to this builder as a remote builder:
$ docker container run --rm -it --network buildkit-test docker
/ # docker buildx create --name remote-buildkitd --driver remote tcp://remote-buildkitd:1234
remote-buildkitd
/ # docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
remote-buildkitd remote
remote-buildkitd0 tcp://remote-buildkitd:1234 running linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
default * error
Cannot load builder default *: error during connect: Get "http://docker:2375/_ping": dial tcp: lookup docker on 172.25.3.1:53: server misbehaving
And I just made a simple Dockerfile to do a test build, I think it was just:
FROM alpine
RUN echo "Hi from $(uname -a)"
And this builds successfully in arm64 and amd64:
/tmp/docker # docker buildx use remote-buildkitd
/tmp/docker # ls
Dockerfile
/tmp/docker # docker buildx build --platform linux/arm64,linux/amd64 --progress=plain .
WARNING: No output specified with remote driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
#1 [internal] load .dockerignore
#1 transferring context: 2B 0.0s done
#1 DONE 0.1s
#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 88B 0.0s done
#2 DONE 0.1s
#3 [linux/arm64 internal] load metadata for docker.io/library/alpine:latest
#3 DONE 2.5s
#4 [linux/amd64 internal] load metadata for docker.io/library/alpine:latest
#4 DONE 2.5s
#5 [linux/arm64 1/2] FROM docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
#5 resolve docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4 0.0s done
#5 sha256:261da4162673b93e5c0e7700a3718d40bcc086dbf24b1ec9b54bca0b82300626 3.26MB / 3.26MB 0.2s done
#5 extracting sha256:261da4162673b93e5c0e7700a3718d40bcc086dbf24b1ec9b54bca0b82300626
#5 extracting sha256:261da4162673b93e5c0e7700a3718d40bcc086dbf24b1ec9b54bca0b82300626 0.2s done
#5 DONE 0.4s
#6 [linux/amd64 1/2] FROM docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
#6 resolve docker.io/library/alpine:latest@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4 0.0s done
#6 sha256:c158987b05517b6f2c5913f3acef1f2182a32345a304fe357e3ace5fadcad715 3.37MB / 3.37MB 0.3s done
#6 extracting sha256:c158987b05517b6f2c5913f3acef1f2182a32345a304fe357e3ace5fadcad715
#6 extracting sha256:c158987b05517b6f2c5913f3acef1f2182a32345a304fe357e3ace5fadcad715 0.1s done
#6 DONE 0.4s
#7 [linux/amd64 2/2] RUN echo "Hi from $(uname -a)"
#0 0.070 Hi from Linux buildkitsandbox 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 Linux
#7 DONE 0.2s
#8 [linux/arm64 2/2] RUN echo "Hi from $(uname -a)"
#0 0.135 Hi from Linux buildkitsandbox 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 aarch64 Linux
#8 DONE 0.2s
So that's fine. But with the qemu binaries installed on the host, I'm able to just start a buildkit builder on a dockerd running inside a CI build container. The CI runner uses the Docker executor. My config for the Docker executor is this:
# ...
[runners.docker]
tls_verify = true
image = "ubuntu:20.04"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/certs/client", "/cache", "/var/lib/docker"]
shm_size = 0
runtime = "sysbox-runc"
extra_hosts = ["gitlab-ci-minio:host-gateway"]
(I guess I probably shouldn't share /var/lib/docker with the runners...). My CI job that runs my buildx build runs on an image I created that's basically https://github.com/nestybox/dockerfiles/tree/master/alpine-supervisord-docker — it contains its own dockerd. In the CI job all I do is
$ docker buildx create --use
$ docker buildx bake --push
So buildkitd is running under the dockerd running in the CI job's container. And this builds a real image with plenty of stuff in, not just the toy Dockerfile above.
Thank you @h4l!
With some minor modifications we got it working in our Buildkite environment as well!
All our modifications where tweaks around using an existing per-job network as well as loading up the buildx configuration in every builds environment hooks :)
@mhornbacher Nice job, that's great, glad you got it working!
@rodnymolina @ctalledo I think this is a fairly workable solution for most customers. Big shoutout to @h4l for his discoveries!
For anyone stumbling upon this via a search.
- Install
qemu-user-staticandbinfmt-supportwhich fixes runningmoby/buildkitunder sysbox. - Expose
moby/buildkitto your container via the docker network (either via a custom bridge network or the default one) and use it as a remote builder
Implementation details are in the above two comments 👍
Feel free to @ me with a comment here if you need more explanation
Thanks @mhornbacher for your contribution to resolving this issue, much appreciated!
Just a heads-up that the workaround mentioned above to allow multi-arch builds within a sysbox container is not working as of v0.6.2 release (thanks to @DekusDenial for letting us know).
This is a consequence of a recent change to ensure that Sysbox exposes all the procfs and sysfs nodes known within the container's namespaces. As a side-effect of this, we stopped exposing a few host nodes within sysbox containers, which on one hand offers greater security, but on the other, it breaks functionality like the one required by the above workaround.
As a fix, we could make an exception for these nodes by creating a new 'handler' in sysbox-fs to expose the host's /proc/sys/fs/binfmt_misc nodes. We will take this into account for the next release (ping us if you can't wait).
Thanks. Pinging @jmandel1027 :)
I have same problem to run CICD, any workaround? thanks.
You can use version 0.6.1 with the workarounds detailed above if you need mutli arch builds for now. This is no longer my day-to-day so I am not on top of any 0.6.2 workarounds yet
I am using gitea with self host act_runner , seems to not use moby/buildkit at all.
- name: Checkout
uses: https://github.com/actions/checkout@v3
- name: Set up QEMU
uses: https://github.com/docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: https://github.com/docker/setup-buildx-action@v2
Hi @DekusDenial,
will the [upcoming 0.6.3] release cover the fix for https://github.com/nestybox/sysbox/issues/592#issuecomment-1611783474?
I am not sure; I see a few related commits, but nothing that specifically addresses that issue. How can I reproduce the problem?
I am not sure; I see a few related commits, but nothing that specifically addresses that issue. How can I reproduce the problem?
I enabled qemu on a host and run sysbox container but /proc/sys/fs/binfmt_misc was empty and didnt inherit from the host
I enabled qemu on a host and run sysbox container but /proc/sys/fs/binfmt_misc was empty and didnt inherit from the host
I see; no unfortunately the ability to expose (i.e., namespace) binfmt_misc inside the Sysbox container is not present in this release.
And exposing the host's binfmt_misc inside the container (as Sysbox used to do by mistake) is not a good solution because it's a global resource in the system (i.e., if a container modifies the binary associated with a file-type, all other containers would be affected by the change). Ideally binfmt_misc needs to be per-Sysbox container. It's not a simple task because it requires Sysbox to emulate binfmt_misc inside the container. It can be done but it's a difficult task, and we've not yet had the chance to work on it unfortunately.
That’s the reason we still pin to 0.6.1 given the situation in which we have no other mean to enable qemu for multi-arch workload.
That’s the reason we still pin to 0.6.1 given the situation in which we have no other mean to enable qemu for multi-arch workload.
0.6.1 support it out of the box?
That’s the reason we still pin to 0.6.1 given the situation in which we have no other mean to enable qemu for multi-arch workload.
Got it; would the work-around in the comment above help?
Got it; would the work-around in the comment above help?
Unfortunately no because our setup and use case is not the same as above, and based on the reporter’s comment above this issue also prevented them from moving to 0.6.2 and the work around may no longer work.
Unfortunately no because our setup and use case is not the same as above, and based on the reporter’s comment above this issue also prevented them from moving to 0.6.2 and the work around may no longer work.
I see; not sure how to help then: we can't go back to the v0.6.1 behavior because it breaks container isolation (i.e., it allows a container to modify a globally shared system resource, the binfmt_misc subsystem). But on the other hand the proper fix is a heavy lift.
The only thing I can think of is adding a config option in Sysbox that allows the container to access the host's binfmt_misc; the config would be set per-container, via an env variable (e.g.,docker run --runtime=sysbox-runc -e SYSBOX_EXPOSE_HOST_BINFMT_MISC ...). This way the user can choose whether to do this at the risk of knowing that container isolation is reduced.
The only thing I can think of is adding a config option in Sysbox that allows the container to access the host's binfmt_misc
I think most people would prefer this config as a work around for qemu. FYI, I used to have workload scheduled on a kata-container in which user can register qemu on-demand via privileged docker but this won’t work inside a sysbox container, that’s why I have been relying on the host qemu to pre-provide qemu.
Meanwhile, if you can provide the sources where this config would be implemented or another word where this binfmt_misc resource would be excluded from isolation, people can patch it on their side.
Meanwhile, if you can provide the sources where this config would be implemented or another word where this binfmt_misc resource would be excluded from isolation, people can patch it on their side.
It's a bit more complicated; the work would be in sysbox-fs (the component that emulates portions of /proc and /sys inside the container).
For the emulated files or directories within /proc and /sys, sysbox-fs has a concept of a "handler" that performs the emulation for a file or a directory. We would need to create a new handler for /proc/sys/fs/binfmt_misc, and that handler would expose the host's binfmt_misc into the container. The code for the other handlers is here.
In addition, we would need to add the code that enables the feature on a per-container basis. That requires changes in sysbox-runc and the transport that sends the config to sysbox-fs.
It's not super difficult, but it's not a simple change either. For someone that knows the code, it's a few days of work (we also have to write the tests). If you wish to contribute (we appreciate that!), let me know and I can provide more orientation.
Otherwise it will have to wait until we have the cycles, as we balance Sysbox development & maintenance with other work at Docker.
I really look forward to this feature being available soon.