tika-docker icon indicating copy to clipboard operation
tika-docker copied to clipboard

Add Github CI workflows for multi-arch Docker images

Open fpiesche opened this issue 1 year ago • 17 comments

With the increasing popularity of people running homelab servers and entire clusters on ARM SBCs such as the Raspberry Pi (hello) it might be useful to have multi-architecture images that run on x86, ARM64 and possibly even 32-bit ARM machines such as the Raspberry Pi 2. I've added a Github workflow that takes the TIKA_VERSION build arg for the Docker images as an input parameter and will build Docker images for Tika for all of these architectures (with others easy to add as long as the base images support them) and upload them as a single multi-architecture manifest via Docker's BuildKit functionality.

As an example: https://hub.docker.com/r/florianpiesche/tika-minimal/tags https://hub.docker.com/r/florianpiesche/tika-full/tags

This significantly simplifies the build process for new images - new Docker images for arbitrary versions of Tika can be built and pushed out to multiple registries from Github's web UI literally at the click of a button.

fpiesche avatar Aug 12 '23 13:08 fpiesche

👏🏻

Had this on my todo list for next week and wanted to come up with a suggestion as well. Kudos!

mpdude avatar Aug 12 '23 19:08 mpdude

Added some more comments documenting what various bits of the workflow and dependabot configuration file do while I was at it!

fpiesche avatar Aug 16 '23 16:08 fpiesche

Awesome, this would let us archive https://github.com/paperless-ngx/tika, which existed solely for providing multi-arch images

stumpylog avatar Aug 17 '23 02:08 stumpylog

@stumpylog my paperless-ngx installation is in fact what I created this PR for :D

fpiesche avatar Aug 17 '23 08:08 fpiesche

@lewismc any chance you could get this merged please?

It would be great to have arm64 images

nextgens avatar Oct 11 '23 13:10 nextgens

This would be a fantastic modernization of our release process with the added benefit of multi-arch. Thank you!

I have two small concerns/questions:

  1. I'm not sure we want dependabot nagging us for base image updates up-to-daily. Or, if it does, is the expectation that we'll make releases up-to-daily? This would be an up-to-daily chore that I'm not sure we have time for.
  2. How can we run the tests before pushing to docker hub?

tballison avatar Oct 11 '23 13:10 tballison

more questions...sorry. 3. We moved to versioning of docker images of {tika-version}.{docker.version}. For example, 2.9.0.1 would be the second docker release for Tika 2.9.0. Does this handle that? 4. To confirm, this is building and deploying both full-* and minimal-* and this is tagging the release with the docker release number and latest? I think the answer to both is yes based on looking at the PR more carefully, but wanted to confirm.

tballison avatar Oct 11 '23 13:10 tballison

I'll put some work into some of your concerns. To answer your questions for now:

  1. Dependabot will check for base image updates on a daily basis, but will only open a PR if the latest base image tag has changed. This won't actually happen with the current setup as the specified tag is not a versionable tag. What this would do is if e.g. the Dockerfiles switched to using ubuntu:23.10 instead of ubuntu:jammy, dependabot would open a PR to update this to 24.04 once that image is released, and again 24.10 once that drops, etc. Dependabot does not check for whether an already-existing tag has had a new image deployed (as this wouldn't need changes to the Dockerfiles, just a rebuild). Similarly, the github-actions section will ensure that if a new version of any of the Actions used (e.g. docker/build-push-action) is released, Dependabot will open a PR to update the workflows to use this. CI builds can then run to verify that the builds still work, after which the PR could just be merged and the workflows thus kept up to date. This could be an up to daily task if Docker or GitHub decided to release new versions of their actions multiple days in a row, but updates to GitHub actions are generally fairly infrequent as well.

  2. This is a change I can make - what are the tests you'd want to run? Are they unit tests, or do they involve bringing up the built image and running some tests on the build to ensure it's working properly? Either should be fairly easy to implement.

  3. Currently the images are tagged with just the Tika version - however I've checked and since you already create tags in this GitHub repo with the proper version numbers, this is just a matter of changing the tag parameter in line 81 of the workflow. I'll change this in a bit...

  4. This is correct. Images will be built as tika-full and tika-minimal respectively (lines 77/78 of the workflow) and pushed under the latest tag and the version tag to the GitHub container registry and Docker Hub (ie. they'll be accessible as ghcr.io/apache/tika-minimal|tika-latest and docker.io/${{secrets.DOCKERHUB_USERNAME}}/tika-minimal|tika-latest). Tagging the most recent stable release of an image as latest is the de facto standard with most Docker images; development images usually use edge as a generic "most recent development build" option. The Docker metadata action can specify further possible tags, e.g. git commit SHA sums, arbitrary text, build numbers etc.

fpiesche avatar Oct 17 '23 18:10 fpiesche

Wow, thank you @fpiesche! All looks good.

For tests, this is our current release process. Our rudimentary tests are here: https://github.com/apache/tika-docker/blob/master/docker-tool.sh#L39 We'd always welcome more tests!

tballison avatar Oct 18 '23 10:10 tballison

With this most recent set of changes:

  • The workflow will automatically trigger if a new GitHub release is created and/or a new tag (of format *.*.*.*) is pushed
  • The build will run and locally store the image, then
    • run the newly-built image
    • as per the docker-tool.sh, check that the service is responding to http requests and running as the expected user
  • If tests have passed and either the "push image" flag is set on a manual build or the build is for a tag/release, the image will be pushed to the remote repositories with the following tags:
    • the name of the latest tag
    • the Tika version (so e.g. the image tagged 2.8.0 will always be the latest build out of the 2.8.0.x tags in the git repository)
    • latest

And finally, here's an example build triggered by creating a new GitHub release

fpiesche avatar Nov 03 '23 20:11 fpiesche

A good opportunity to show how Dependabot PRs work! ;) after enabling Dependabot on my fork: Screenshot_20231103_230717

https://github.com/fpiesche/tika-docker/pulls?q=is%3Apr+is%3Aclosed

None of these needed any changes before merging. Manually-run example build after closing the lot, to check the update hasn't broken anything: https://github.com/fpiesche/tika-docker/actions/runs/6751065869

fpiesche avatar Nov 04 '23 00:11 fpiesche

@fpiesche and fellow devs...IIUC, to release this from the Apache account, I'd have to add my personal dockerhub username and password to the secrets in Apache account, where anyone with collaborator status could use it.

Is there a better solution? Should I make the release from my personal fork?

tballison avatar Nov 06 '23 13:11 tballison

And, sorry for the noise, but we recently changed our main branch to {{main}} -- https://issues.apache.org/jira/browse/TIKA-4163

tballison avatar Nov 06 '23 15:11 tballison

Hello! :wave:

This PR has been kicking around for some time and I'm interested in official arm images (right now we are building our own internally).

Is anything blocking this PR from being merged? I see @lewismc requested changes but there's been many updates since.

Does this need shepherding through? It's not clear if @fpiesche is active on this work anymore?

Thank you! Happy to help as necessary.

bartek avatar Mar 09 '24 00:03 bartek

We could definitely use some help. This is not an area of strength for me and has fallen off my plate.

tballison avatar Mar 09 '24 00:03 tballison

Sorry for dropping off the radar - I've had a lot of Life coming at me over the past few months so until fairly recently a lot of my personal github stuff fell by the wayside. It's getting late here but I'll make some time to sort out that regex tomorrow.

As for the Docker username/token, adding your personal account/token to the github repo as secrets would indeed be the approach with this workflow. Does the ASF maybe have the ability or a process for setting up org-level accounts for things like accessing project Docker repos for builds (so eg. to have an apache-tika Docker account that's controlled by the ASF and just has a Docker Hub access token for external CI processes etc)?

I don't have a Docker subscription myself so I honestly have no idea how the Docker Hub CI works or how it could be configured to build multi-arch images - this workflow wouldn't apply for that process at all :thinking: I had set the GH workflow up on my personal fork as running builds on GitHub's CI and then pushing them to Docker Hub and GHCR from there is how I handle most of my hobby Docker projects, but as those have just me working on them I hadn't run into the token sharing problem...

fpiesche avatar Mar 09 '24 01:03 fpiesche

Thank you @fpiesche!

Yes, it sounds like there ideally is an existing process within the ASF to handle credentials which would be part of a CI/upload to Docker Hub. I would guess that this has been already handled given there is a process to publish the existing amd64 images?

Aside, I added one tiny comment on the PR :-)

bartek avatar Mar 12 '24 02:03 bartek

Any chance we could have multi-arch for 2.9.2 ?

nextgens avatar May 20 '24 09:05 nextgens

@nextgens , I don't have enough knowledge to move forward on this PR alone. If there's a simpler way to achieve multi-arch with fewer mods to our current process, I'd be more than happy to review.

tballison avatar May 20 '24 13:05 tballison

If be happy to help with reviewing it, but I don't see there is a simpler way. Using actions is the best way to achieve this end goal.

stumpylog avatar May 20 '24 13:05 stumpylog

Let me ping infra at asf to see what we need to do to get this working as an action. I think that's the blocker for me.

tballison avatar May 20 '24 13:05 tballison

Pinged asf infra on credentials and how to do this for an asf project.

tballison avatar May 20 '24 13:05 tballison

I opened: https://issues.apache.org/jira/browse/TIKA-4258 to track this on our JIRA. I also opened an issue on infra.

tballison avatar May 20 '24 13:05 tballison

It looks like Airflow at least has moved away from github actions and moved towards a release manager building locally and pushing to dockerhub -- https://cwiki.apache.org/confluence/display/INFRA/Github+Actions+to+DockerHub

Can I just use buildx along these lines to use our current workflow without github actions? https://developers.redhat.com/articles/2023/11/03/how-build-multi-architecture-container-images#docker_buildx

tballison avatar May 20 '24 13:05 tballison

If securing the credentials required for dockerhub is the only concern, I think using github container registry instead may be a great solution. https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

If you still want the images to be on dockerhub you could sync them (locally or otherwise) using a tool such as https://github.com/regclient/regclient/. We use it in Mailu, see https://github.com/Mailu/Mailu/blob/master/.github/workflows/mirror.yml#L35

nextgens avatar May 20 '24 14:05 nextgens

How's this for a proposed way forward?

We basically keep our current workflow on the release manager's laptop/hardware. We modify our build scripts to build a single-arch image, run our usual tests and then do a second call to docker buildx where we build multiarch images and then deploy to dockerhub?

tballison avatar May 20 '24 14:05 tballison

If securing the credentials required for dockerhub is the only concern, I think using github container registry instead may be a great solution. https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

If you still want the images to be on dockerhub you could sync them (locally or otherwise) using a tool such as https://github.com/regclient/regclient/. We use it in Mailu, see https://github.com/Mailu/Mailu/blob/master/.github/workflows/mirror.yml#L35

Awesome. Thank you. ASF infra has a way to do the auth. My current thinking is not to rework our workflow into github actions, but rather see if we can tweak our current workflow to get multi-arch images.

tballison avatar May 20 '24 14:05 tballison

I think building multiarch with buildx requires QEMU, but as long as that's available on the host doing the builds just running buildx should be perfectly fine - that's all the github workflow does after all!

As a side note, I'd also recommend uploading the images to both GHCR and Docker Hub nowadays if that's an option - Docker have started putting restrictions in place to push users into subscriptions so alternatives are good to have. I think you can just specify multiple tags with the docker buildx call to upload to multiple registries (as long as you're authenticated with them all), so eg. -t ghcr.io/apache/tika:latest -t docker.io/apache/tika:latest should upload the image to both of those repos (and even more could be added similarly, eg. RedHat's Quay or an Apache-owned registry if such a thing happens at some point).

fpiesche avatar May 20 '24 15:05 fpiesche

Let's add other registries on a later ticket?

How's this look? https://github.com/apache/tika-docker/pull/21

I haven't tested it -- famous last words...

tballison avatar May 20 '24 16:05 tballison

Wow...it looks like it actually worked?!

Can you all give this a shot? https://hub.docker.com/layers/apache/tika/2.9.2-alpha-multi-arch/images/sha256-b8b6e02e3e9f98ddae33b74881f4ead7846ee12352d53149098857378bb3393d?context=repo

tballison avatar May 20 '24 17:05 tballison