Compatibility issue between QEMU and Linux kernel cause buildx failing
Contributing guidelines
- [x] I've read the contributing guidelines and wholeheartedly agree
I've found a bug and checked that ...
- [x] ... the documentation does not mention anything about my problem
- [x] ... there are no open or closed issues that are related to my problem
Description
Our team has been using buildx to build multi-arch fluent-bit image long time ago. However, the build for ARM64 image on debian:bullseye started to failing two months ago with following error:
#39 249.8 ===============================================================================
#39 250.9 [ 27%] Performing build step for 'jemalloc'
#39 251.5 gcc: internal compiler error: Segmentation fault signal terminated program cc1
#39 251.5 Please submit a full bug report,
#39 251.5 with preprocessed source if appropriate.
After a lot of troubleshooting steps, we noticed there is some compatibility issue between QEMU and Debian kernel 5.10.0-33-cloud-amd64/5.10.0-33-debian-amd64. We used following approach to setup QEMU for buildx:
sudo docker run --privileged --rm tonistiigi/binfmt:qemu --install all
sudo docker buildx create --name builder --use
sudo docker buildx inspect --bootstrap
And we tried different version of QEMU including 6.2, 7.0, 8.2 and 9.2.1 (latest) but none of them works with this kernel. As soon as I downgrade the kernel version to 5.10.0-32-cloud-amd64, the build starts to work again.
Since changing kernel version, native build and cross-compiling are not options for our CI//CD pipeline, we are wondering how can we move forward to address this problem.
Expected behaviour
The build succeeds.
Actual behaviour
The build failed with internal compiler error.
Buildx version
github.com/docker/buildx v0.17.1 257815a
Docker info
Client: Docker Engine - Community
Version: 27.3.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.17.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.29.7
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 5
Server Version: 27.3.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
runc version: v1.2.2-0-g7cb3632
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.10.0-33-cloud-amd64
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 31.35GiB
Name: instance-20241130-035110
ID: e00475b0-25b9-473b-a839-38acdcf7cb77
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Builders list
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
builder* docker-container
\_ builder0 \_ unix:///var/run/docker.sock running v0.17.2 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
default docker
\_ default \_ default running v0.16.0 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
Configuration
https://github.com/fluent/fluent-bit/blob/master/dockerfiles/Dockerfile
docker buildx build --platform=linux/arm64 -f ./dockerfiles/Dockerfile .
Build logs
Additional info
No response
Defining ENV QEMU_STRACE=1 will show you trace of syscalls proxied by the emulator and may point to potential issue. If that works you can try to submit your findings to qemu upstream tracker.
We run into a very similar issue when trying to build jemalloc with Alpine 3.20. The build pipeline used to work three weeks ago with
#9 201.3 gcc -std=gnu11 -Wall -Wextra -Wsign-compare -Wundef -Wno-format-zero-length -Wpointer-arith -Wno-missing-braces -Wno-missing-field-initializers -Wno-missing-attributes -pipe -g3 -fvisibility=hidden -Wimplicit-fallthrough -O3 -funroll-loops -fPIC -DPIC -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -DJEMALLOC_NO_PRIVATE_NAMESPACE -o src/extent.sym.o src/extent.c
#9 204.3 gcc -std=gnu11 -Wall -Wextra -Wsign-compare -Wundef -Wno-format-zero-length -Wpointer-arith -Wno-missing-braces -Wno-missing-field-initializers -Wno-missing-attributes -pipe -g3 -fvisibility=hidden -Wimplicit-fallthrough -O3 -funroll-loops -fPIC -DPIC -c -D_GNU_SOURCE -D_REENTRANT -Iinclude -Iinclude -DJEMALLOC_NO_PRIVATE_NAMESPACE -o src/extent_dss.sym.o src/extent_dss.c
#9 206.1 make: *** [Makefile:480: src/emap.sym.o] Segmentation fault (core dumped)
#9 206.1 make: *** Deleting file 'src/emap.sym.o'
#9 206.1 make: *** Waiting for unfinished jobs....
I suspect the runner image is the culprit, the build which worked three weeks ago was using 20250105.1.0, today the runner image version is 20250209.1.0. From the runner image readme it seems the kernel version went from 6.8.0-1017-azure to 6.8.0-1021-azure in that time frame.
And we tried different version of QEMU including 6.2, 7.0, 8.2 and 9.2.1 (latest) but none of them works with this kernel. As soon as I downgrade the kernel version to
5.10.0-32-cloud-amd64, the build starts to work again.
From Debian sources it seems that 5.10.0-32-cloud-amd64 was based on 5.10.222, whereas 5.10.0-33-cloud-amd64 5.10.224. These kernel releases are quite a while ago, so I wonder if my case just happens to be similar or if the same underlying kernel change is the culprit 🤔
https://packages.debian.org/search?keywords=linux-image-5.10.0-32-cloud-amd64
https://packages.debian.org/search?keywords=linux-image-5.10.0-33-cloud-amd64
https://salsa.debian.org/kernel-team/linux/-/blob/debian/5.10/bullseye-security/debian/changelog?ref_type=heads#L537
FWIW, at least in our case this seemed to be related to the segfaults described in https://github.com/tonistiigi/binfmt/issues/215#issuecomment-2613004741, which were caused by a kernel hardening patch which got backported to stable kernels. It seems that this triggers a QEMU bug.