container-images icon indicating copy to clipboard operation
container-images copied to clipboard

Draft: multi arch build for debezium (DBZ-5213)

Open nilshartmann opened this issue 2 years ago • 8 comments

This is my draft how to build multi arch (amd64 and arm64) builds for debezium (https://issues.redhat.com/browse/DBZ-5213).

Note that I do not have much experience with docker builds (esp. not with multi arch builds), so probably there are better or other ways to implement the build. But maybe my changes can be a starting point.

Changes on this branch:

  • all images are built for amd64 and arm64
  • all images hopefully tagged with the same semantics as before 🙏
  • all images are pushed to docker hub and quay.io.
    • That behaviour differs from the current version. I think with buildx you HAVE to push the image somewhere as long as you build for more than one architecture.
    • The script could be changed, so that it would be possible to not push the image if only one architecture is specified (for local tests for example)
  • I made the name of the "debezium" docker hub user configurable
    • it defaults to debezium
    • I don't know if that is really neccessary, but I didn't found another way to actually test the push and pull of the images. So changing the name is for testing only. If there is a better way for testing, please let me know.
  • I made the name of the "second" registry (quay.io) configurable
    • it defaults to quay.io
    • this change is also related to testing: while testing you can specify another second registry (for example a local docker registry)
  • the GH workflow actions install buildx
    • for some reason the build on github still fails, maybe someone with buildx and/or GH actions know-how can help.
    • all changes work on my local machine 🙄

nilshartmann avatar Aug 23 '22 14:08 nilshartmann

@nilshartmann I will ask for a bit of a patience here, I will review this within next week or so (I'm currently on PTO).

@jpechane does it make sense to retrospectively change the docker files for previous releases? Although this is just parametrisation I wouldn't do that.

jcechace avatar Aug 26 '22 15:08 jcechace

does it make sense to retrospectively change the docker files for previous releases?

Doing those changes for 1.9 and 2.0 definitely is enough. We don't "support" older versions anymore at this point.

gunnarmorling avatar Aug 26 '22 15:08 gunnarmorling

Doing those changes for 1.9 and 2.0 definitely is enough. We don't "support" older versions anymore at this point.

What about duplicate the current scripts (build-multi-arch-debezium.sh for ex) and let them run with parameterized Dockerfiles only for 1.9 and upcoming releases. The current scripts then can be limited to pre-1.9 releases with current, non parameterized Dockerfiles? So changes would affect only Dockerfiles 1.9+ (and new build scripts).

nilshartmann avatar Aug 29 '22 06:08 nilshartmann

@nilshartmann The build-all script should likely be extended to allow specifying which components to build. Ideally it should be able to determine that on its own and skip those were ARM build is not possible (however this might not be feasible).

Currently I can see 2 reasons why the build in GH is failing

  1. The GH action may not setup a buildx builder (not sure about this one)
  2. It will for sure fail with example-mongo as mongo:3.2 doesn't have ARM manifest. The build ends with
 => ERROR [linux/arm64 internal] load metadata for docker.io/library/mongo:3.2
...
ERROR: failed to solve: mongo:3.2: no match for platform in manifest sha256:0463a91d8eff189747348c154507afc7aba045baa40e8d58d8a4c798e71001f3: not found

Further attempt to explicitly pull mongo on AMR leads to cleaner error message

$docker pull mongo:3.2
3.2: Pulling from library/mongo
no matching manifest for linux/arm64/v8 in the manifest list entries

jcechace avatar Sep 19 '22 12:09 jcechace

@nilshartmann I might be missing few steps so if you could provide steps how to run the build (ideally with the use of registry:2 container as a registry, it would be greatly appreciated.

jcechace avatar Sep 19 '22 13:09 jcechace

Hi @jcechace!

Thanks a lot for your feedback and comments!

The build-all script should likely be extended to allow specifying which components to build. Ideally it should be able to determine that on its own and skip those were ARM build is not possible (however this might not be feasible).

I will change the script. Probably first by manually setting a list of platforms for each image (or list of images that should not be built for ARM), as it seems to be much easier to implement. If anything else works, we can try to automatically determine for what images ARM should be skipped imho.

Currently I can see 2 reasons why the build in GH is failing

1. The GH action may not setup a buildx builder (not sure about this one)

I think buildx build is set up, as I tried with a very simple image and that worked. But I'll re-check.

2. It will for sure fail with example-mongo as mongo:3.2 doesn't have ARM manifest. The build ends with

Thanks! Good point!

Regarding your other questions. I introduced the env variables to be able to set different image names/registries during my local test builds:

  • DEBEZIUM_DOCKER_NAME: I set this to my dockerhub username to test build and deploy to docker hub
  • DEBEZIUM_QUAY_IO: While tests I set this to a local docker registry

If there is a better way to run, test and push the build/images (locally), please let me know. I'm in no way a docker expert 🥺

nilshartmann avatar Sep 20 '22 19:09 nilshartmann

hello @jcechace, a long time no see! :) I just stumbled upon this as I'm using the debezium/connect-base container image in my tests and I realized it's too slow because it's running emulated on M1 mac. It would be so cool to have multi-arch images.

many thanks to @nilshartmann for this work!

jerrinot avatar Sep 22 '22 08:09 jerrinot

Hi @jcechace,

I pushed some new commits. Inside the repository there is a new README-build.md file that describes the steps for building images locally for test.

The build-all script now runs:

  • build-postgres-multiplatform.sh and build-postgres-multiplatform.sh for either with linux/amd64 or linux/amd64,linux/arm64. I tested all versions on all platforms. If the arm build for one version failed, I deceided to stay with linux/amd64 only.
  • Debezium Builds in build-all run the "old" build-debezium.sh for all Debezium versions that are not explizitly listed in build-all.sh (see there). Newer versions are built using build-debezium-multiplatform.sh for linux/amd64,linux/arm64. (btw I had trouble building some of the older versions of Debezium, eventhough I didn't change the script. Might be a problem on Mac M1.)

You can use both env variables DEBEZIUM_DOCKER_NAMEand DEBEZIUM_QUAY_IO to change the registries the images are pushed into (please see the README_build.md file for more information including setup of a local registry:2).

(Maybe DEBEZIUM_DOCKER_NAMEand DEBEZIUM_QUAY_IO should be renamed to something like DEBEZIUM_DOCKER_REGISTRY_1_NAMEand DEBEZIUM_DOCKER_REGISTRY_2_NAME)

BTW you can build the images and push them to a local registry inside a GitPod workspace if you like. Just open the branch in GitPod, then run on terminal:

docker-compose -f local-registry/docker-compose.yml up -d

./setup-local-builder.sh 

export DEBEZIUM_DOCKER_NAME=localhost:5500/debezium
export DEBEZIUM_QUAY_IO=localhost:5500/debeziumquay

./build-postgres-multiplatform.sh 14-alpine "linux/amd64,linux/arm64"
./build-debezium-multiplatform.sh 1.9

Propably there is still some work to do, please let me know, if I can help.

Thanks, Nils

nilshartmann avatar Sep 23 '22 17:09 nilshartmann

Doing those changes for 1.9 and 2.0 definitely is enough. We don't "support" older versions anymore at this point.

What about duplicate the current scripts (build-multi-arch-debezium.sh for ex) and let them run with parameterized Dockerfiles only for 1.9 and upcoming releases. The current scripts then can be limited to pre-1.9 releases with current, non parameterized Dockerfiles? So changes would affect only Dockerfiles 1.9+ (and new build scripts).

Sounds like a good solution to me -- we keep the ability to build the old (but unsupported) and also add the new options. It also seems to be in order with how we duplicate dockerfiles for new versions of DBZ.

jcechace avatar Sep 26 '22 08:09 jcechace

@nilshartmann Regarding the naming such as DEBEZIUM_DOCKER_REGISTRY_1_NAME... I think this would be better. Maybe even DEBEZIUM_DOCKER_REGISTRY_PRIMARY_NAME.

jcechace avatar Sep 26 '22 09:09 jcechace

I hit and error

ERROR: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c echo \"$SHA256HASH /tmp/zookeeper.tar.gz\" | sha512sum -c - &&    tar -xzf /tmp/zookeeper.tar.gz -C $ZK_HOME --strip-components 1 &&    rm -f /tmp/zookeeper.tar.gz" did not complete successfully: exit code: 2

This seems to be related to a required qemu interpreter.

Sources: https://www.linuxfixes.com/2021/12/solved-can-install-bash-in-multiarch.html https://github.com/docker/buildx/issues/464

multiarch/qemu-user-static didn't seem to work for me (ubuntu arm64 instance in EC2) while the tonistiigi/binfmt one worked. At the moment this is all a black magic for me, so I will poke around a bit before merging. So far the build is running (although It look like it is quite slow -- however it looks like the main bottleneck is downloading maven dependencies so probably unrelated to this PR).

jcechace avatar Sep 26 '22 11:09 jcechace

Hi @jcechace,

Regarding the naming such as DEBEZIUM_DOCKER_REGISTRY_1_NAME... I think this would be better. Maybe even DEBEZIUM_DOCKER_REGISTRY_PRIMARY_NAME.

I changed the names to DEBEZIUM_DOCKER_REGISTRY_PRIMARY_NAME and DEBEZIUM_DOCKER_REGISTRY_SECONDARY_NAME

nilshartmann avatar Sep 26 '22 17:09 nilshartmann

although It look like it is quite slow

I noticed that too, not sure if this is somehow related to the emulator.

I wonder if it really makes sense to rebuild all images everytime (build-all.sh).

At the moment this is all a black magic for me, so I will poke around a bit before merging.

Please let me know if I can help.

nilshartmann avatar Sep 26 '22 17:09 nilshartmann

@jerrinot

although It look like it is quite slow

I noticed that too, not sure if this is somehow related to the emulator.

I wonder if it really makes sense to rebuild all images everytime (build-all.sh). Technically yes, since there is no need to run the script unless there were some changes. The majority of slowness (apart from UI) is PG builds

At the moment this is all a black magic for me, so I will poke around a bit before merging.

Please let me know if I can help.

I think I got a better grasp. Seems an alternative would be a cross build with arch specific builder stages, however that would require a modification of all dockerfiles so our current aproach seems more feasible for now.

jcechace avatar Sep 27 '22 11:09 jcechace

@nilshartmann It looks like PG 14 does not build via buildx on amd64 (it build fine using standard docker build). Another thing which seems to be failing consistently is the UI -- however I think that might be a known issue (the log output for buildx is not as informative as a regular build so I will look more into this).

jcechace avatar Sep 27 '22 11:09 jcechace

@jcechace I tried building PG 14 both on GitPod and Mac M1 using ./build-postgres-multiplatform.sh 14 linux/amd64 and it worked successfuly. Only building for linux64/arm failed.

nilshartmann avatar Sep 27 '22 11:09 nilshartmann

@nilshartmann can you make sure that the buildx cache is pruned? Strangely enough when I built PG 14 using the old build-postgres.sh script with regular docker build, then consequent run of the multiplatform script passed. Otherwise I'm gettting

#0 49.03 make: *** [/usr/lib/postgresql/14/lib/pgxs/src/makefiles/../../src/Makefile.shlib:293: decoderbufs.so] Segmentation fault

This happens on EC2 arm64 VM running Ubuntu.

jcechace avatar Sep 27 '22 12:09 jcechace

@jcechace I did docker buildx prune --all(Mac M1) and the build still works. I also setup a test build on Github (ubuntu) and it works: https://github.com/nilshartmann/pg-buildx/actions/runs/3135577830/jobs/5091422539

nilshartmann avatar Sep 27 '22 12:09 nilshartmann

One remark, can we have the x86 build only by default and multiarch executed only upon an option/env var setting?

jpechane avatar Sep 27 '22 12:09 jpechane

@jpechane So when running build-all.sh the "old" build-debezium, build-postgres and build-mongo scripts should be run? Or would it be sufficient to run be default the new scripts but set the platform to linux/amd64 only? In any case I think it would be easier to provide to scripts build-all.sh and build-all-multiplatform.sh.

nilshartmann avatar Sep 27 '22 12:09 nilshartmann

@nilshartmann That's an implementation detail :-). Yes, setting platform to linux/amd64 is enough. And I agredd with having two scripts, build-all.sh woud produce the same output as now and build-all-multiplatform.sh will do evertyhing.

jpechane avatar Sep 27 '22 13:09 jpechane

Great idea, I think this will provide the largest amount of backward compatibility.

Regarding the build of PG 14... it is strange, at this point I have about 60% success rate on building that particular image with buildx on arm64 EC2 machine. I wonder if something is different on Apple silicon. To make things weirder, I can see two errors consistently...

The one mentioned above and

#0 267.1 Setting up software-properties-common (0.96.20.2-2.1) ...
#0 270.4 Traceback (most recent call last):
#0 270.4   File "/usr/bin/py3compile", line 319, in <module>
#0 270.4     main()
#0 270.4   File "/usr/bin/py3compile", line 298, in main
#0 270.4     compile(files, versions,
#0 270.4   File "/usr/bin/py3compile", line 185, in compile
#0 270.4     cfn = interpreter.cache_file(fn, version)
#0 270.4   File "/usr/share/python3/debpython/interpreter.py", line 212, in cache_file
#0 270.4     (fname[:-3], self.magic_tag(version), last_char))
#0 270.4   File "/usr/share/python3/debpython/interpreter.py", line 246, in magic_tag
#0 270.5     return self._execute('import imp; print(imp.get_tag())', version)
#0 270.5   File "/usr/share/python3/debpython/interpreter.py", line 359, in _execute
#0 270.5     raise Exception('{} failed with status code {}'.format(command, output['returncode']))
#0 270.5 Exception: python3.9 -c 'import imp; print(imp.get_tag())' failed with status code 139
#0 270.6 dpkg: error processing package software-properties-common (--configure):
#0 270.6  installed software-properties-common package post-installation script subprocess returned error exit status 1

Nevertheless, I am also going to try it the other way around and build the packages on amd64. The issue might be with qemu interpreters on arm64 ubuntu.

jcechace avatar Sep 27 '22 13:09 jcechace

@jpechane I will try to implement this later today

nilshartmann avatar Sep 27 '22 13:09 nilshartmann

PG14 (non alpine) builds just fine for both amd64 and arm64 on amd64 ubuntu machine in EC2. I will do one final pass of build-all for both platforms on amd64 host and once @nilshartmann implements the changes @jpechane suggest we should be good to merge this. Another observation... arm64 emulation on x86 is orders of magnitude faster than the other way around (at least in my environment, however considering the instruction complexity of each architecture it makes sense).

@nilshartmann please also correct the failing commit message.

jcechace avatar Sep 27 '22 13:09 jcechace

With latest (force) push, I changed the scripts as discussed:

  • build-*.sh are the "old" scripts, mostly unchanged (see below)
  • build-*-multiplatform.sh contain the new scripts

There is one exception: build-debezium.sh also sets the DEBEZIUM_DOCKER_REGISTRY_PRIMARY_NAME and DEBEZIUM_DOCKER_REGISTRY_SECONDARY_NAME env because those are referenced in the 1.9 and 2.0 Dockerfiles (for kafka, connect etc.).

If you run build-debezium.sh for an older version than 1.9 nothing should change comparing to the current scripts. Running build-debezium.sh for 1.9+ also should not change something, but (base) image names in Dockerfiles are taken from environment variables that are either set on commandline or set to default values inside build-debezium.sh

I also changed build-debezium-multiplatform.sh to align it with the other build-*-multiplatform scripts. It now takes the platforms to built as second argument. So you could limit the build for example to only linux/amd64.

Then there is build-tools.sh and build-tools-multiplatform.sh. Not sure if multiplatform support is needed here, if not, we could remove build-tools-multiplatform.sh otherweise add it to GH actions.

nilshartmann avatar Sep 27 '22 18:09 nilshartmann

@jcechace

ERROR: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c echo \"$SHA256HASH /tmp/zookeeper.tar.gz\" | sha512sum -c - &&    tar -xzf /tmp/zookeeper.tar.gz -C $ZK_HOME --strip-components 1 &&    rm -f /tmp/zookeeper.tar.gz" did not complete successfully: exit code: 2

multiarch/qemu-user-static didn't seem to work for me (ubuntu arm64 instance in EC2) while the tonistiigi/binfmt one worked. At the moment this is all a black magic for me, so I will poke around a bit before merging. So far the build is running (although It look like it is quite slow -- however it looks like the main bottleneck is downloading maven dependencies so probably unrelated to this PR).

There is another GH action (docker/setup-qemu-action@v2) that sets up qemu and I added that to my test build, and it seems to work. See here an example (zookeeper only, but I'm optimistic, that kafka also builds): https://github.com/nilshartmann/docker-arch-build-test/actions/runs/3138713626/jobs/5098350092

Another option would be to replace tar -xzv with gunzip tarfile.tar.gz && tar -xf tarfile.tar. I tried that for zookeeper also and it works (https://github.com/nilshartmann/docker-arch-build-test/actions/runs/3138581786/jobs/5098069252). We would have to change the Dockerfiles for apache and zookeeper (1.9+) and connect (snapshot).

Advantage of changing the docker file over the qemu is imho, that anyone can simply build the images (locally) without having to install qemu.

Update Without QEMU docker-maven-download in connect-base does not build: https://github.com/nilshartmann/docker-arch-build-test/actions/runs/3138809173/jobs/5098552113#step:7:1587, but with QEMU it does: https://github.com/nilshartmann/docker-arch-build-test/actions/runs/3138917357/jobs/5098778184#step:6:1379. So I added docker/setup-qemu-action@v2 to GH action files.

nilshartmann avatar Sep 27 '22 21:09 nilshartmann

Finally I was able to build all 1.9 and 2.0 images (with the exception of debezium-ui 2.0) 😊

Downside: it took more than four hours 😱 https://github.com/nilshartmann/docker-arch-build-test/actions/runs/3138917357/jobs/5098778184

I haven't had a deeper look into it yet, but building the tools images took more than three hours. If anyone has an idea how to speed that up, please let me know. Otherwise I would suggest that we stay with the old build here and do not provide multiplatform images.

nilshartmann avatar Sep 28 '22 06:09 nilshartmann

@nilshartmann nice work, everything looks in order -- I was able to successfully build all the images as well (doing so on x86 is actually much faster (about 3 times) as emulating arm on x86 is faster). I will do some testing with produced images and merge the PR afterwards.

jcechace avatar Sep 29 '22 10:09 jcechace

I fixed the shellcheck error (https://github.com/debezium/container-images/actions/runs/3141452521/jobs/5128785789#step:3:7) with my latest push

nilshartmann avatar Sep 29 '22 16:09 nilshartmann

@jpechane LGTM, one this is that the GH Action for anonymous build will likely take quite a lot of time. Do we perhaps want to make it optional or something?

Otherwise once the action is finished this can be merge.

jcechace avatar Sep 30 '22 09:09 jcechace