tika-docker
tika-docker copied to clipboard
Add Github CI workflows for multi-arch Docker images
With the increasing popularity of people running homelab servers and entire clusters on ARM SBCs such as the Raspberry Pi (hello) it might be useful to have multi-architecture images that run on x86, ARM64 and possibly even 32-bit ARM machines such as the Raspberry Pi 2. I've added a Github workflow that takes the TIKA_VERSION build arg for the Docker images as an input parameter and will build Docker images for Tika for all of these architectures (with others easy to add as long as the base images support them) and upload them as a single multi-architecture manifest via Docker's BuildKit functionality.
As an example: https://hub.docker.com/r/florianpiesche/tika-minimal/tags https://hub.docker.com/r/florianpiesche/tika-full/tags
This significantly simplifies the build process for new images - new Docker images for arbitrary versions of Tika can be built and pushed out to multiple registries from Github's web UI literally at the click of a button.
👏🏻
Had this on my todo list for next week and wanted to come up with a suggestion as well. Kudos!
Added some more comments documenting what various bits of the workflow and dependabot configuration file do while I was at it!
Awesome, this would let us archive https://github.com/paperless-ngx/tika, which existed solely for providing multi-arch images
@stumpylog my paperless-ngx installation is in fact what I created this PR for :D
@lewismc any chance you could get this merged please?
It would be great to have arm64 images
This would be a fantastic modernization of our release process with the added benefit of multi-arch. Thank you!
I have two small concerns/questions:
- I'm not sure we want dependabot nagging us for base image updates up-to-daily. Or, if it does, is the expectation that we'll make releases up-to-daily? This would be an up-to-daily chore that I'm not sure we have time for.
- How can we run the tests before pushing to docker hub?
more questions...sorry.
3. We moved to versioning of docker images of {tika-version}.{docker.version}. For example, 2.9.0.1 would be the second docker release for Tika 2.9.0. Does this handle that?
4. To confirm, this is building and deploying both full-*
and minimal-*
and this is tagging the release with the docker release number and latest? I think the answer to both is yes based on looking at the PR more carefully, but wanted to confirm.
I'll put some work into some of your concerns. To answer your questions for now:
-
Dependabot will check for base image updates on a daily basis, but will only open a PR if the latest base image tag has changed. This won't actually happen with the current setup as the specified tag is not a versionable tag. What this would do is if e.g. the Dockerfiles switched to using
ubuntu:23.10
instead ofubuntu:jammy
, dependabot would open a PR to update this to 24.04 once that image is released, and again 24.10 once that drops, etc. Dependabot does not check for whether an already-existing tag has had a new image deployed (as this wouldn't need changes to the Dockerfiles, just a rebuild). Similarly, thegithub-actions
section will ensure that if a new version of any of the Actions used (e.g.docker/build-push-action
) is released, Dependabot will open a PR to update the workflows to use this. CI builds can then run to verify that the builds still work, after which the PR could just be merged and the workflows thus kept up to date. This could be an up to daily task if Docker or GitHub decided to release new versions of their actions multiple days in a row, but updates to GitHub actions are generally fairly infrequent as well. -
This is a change I can make - what are the tests you'd want to run? Are they unit tests, or do they involve bringing up the built image and running some tests on the build to ensure it's working properly? Either should be fairly easy to implement.
-
Currently the images are tagged with just the Tika version - however I've checked and since you already create tags in this GitHub repo with the proper version numbers, this is just a matter of changing the tag parameter in line 81 of the workflow. I'll change this in a bit...
-
This is correct. Images will be built as
tika-full
andtika-minimal
respectively (lines 77/78 of the workflow) and pushed under thelatest
tag and the version tag to the GitHub container registry and Docker Hub (ie. they'll be accessible asghcr.io/apache/tika-minimal|tika-latest
anddocker.io/${{secrets.DOCKERHUB_USERNAME}}/tika-minimal|tika-latest
). Tagging the most recent stable release of an image aslatest
is the de facto standard with most Docker images; development images usually useedge
as a generic "most recent development build" option. The Docker metadata action can specify further possible tags, e.g. git commit SHA sums, arbitrary text, build numbers etc.
Wow, thank you @fpiesche! All looks good.
For tests, this is our current release process. Our rudimentary tests are here: https://github.com/apache/tika-docker/blob/master/docker-tool.sh#L39 We'd always welcome more tests!
With this most recent set of changes:
- The workflow will automatically trigger if a new GitHub release is created and/or a new tag (of format
*.*.*.*
) is pushed - The build will run and locally store the image, then
- run the newly-built image
- as per the
docker-tool.sh
, check that the service is responding to http requests and running as the expected user
- If tests have passed and either the "push image" flag is set on a manual build or the build is for a tag/release, the image will be pushed to the remote repositories with the following tags:
- the name of the latest tag
- the Tika version (so e.g. the image tagged
2.8.0
will always be the latest build out of the2.8.0.x
tags in the git repository) -
latest
And finally, here's an example build triggered by creating a new GitHub release
A good opportunity to show how Dependabot PRs work! ;) after enabling Dependabot on my fork:
https://github.com/fpiesche/tika-docker/pulls?q=is%3Apr+is%3Aclosed
None of these needed any changes before merging. Manually-run example build after closing the lot, to check the update hasn't broken anything: https://github.com/fpiesche/tika-docker/actions/runs/6751065869
@fpiesche and fellow devs...IIUC, to release this from the Apache account, I'd have to add my personal dockerhub username and password to the secrets in Apache account, where anyone with collaborator status could use it.
Is there a better solution? Should I make the release from my personal fork?
And, sorry for the noise, but we recently changed our main branch to {{main}} -- https://issues.apache.org/jira/browse/TIKA-4163
Hello! :wave:
This PR has been kicking around for some time and I'm interested in official arm images (right now we are building our own internally).
Is anything blocking this PR from being merged? I see @lewismc requested changes but there's been many updates since.
Does this need shepherding through? It's not clear if @fpiesche is active on this work anymore?
Thank you! Happy to help as necessary.
We could definitely use some help. This is not an area of strength for me and has fallen off my plate.
Sorry for dropping off the radar - I've had a lot of Life coming at me over the past few months so until fairly recently a lot of my personal github stuff fell by the wayside. It's getting late here but I'll make some time to sort out that regex tomorrow.
As for the Docker username/token, adding your personal account/token to the github repo as secrets would indeed be the approach with this workflow. Does the ASF maybe have the ability or a process for setting up org-level accounts for things like accessing project Docker repos for builds (so eg. to have an apache-tika
Docker account that's controlled by the ASF and just has a Docker Hub access token for external CI processes etc)?
I don't have a Docker subscription myself so I honestly have no idea how the Docker Hub CI works or how it could be configured to build multi-arch images - this workflow wouldn't apply for that process at all :thinking: I had set the GH workflow up on my personal fork as running builds on GitHub's CI and then pushing them to Docker Hub and GHCR from there is how I handle most of my hobby Docker projects, but as those have just me working on them I hadn't run into the token sharing problem...
Thank you @fpiesche!
Yes, it sounds like there ideally is an existing process within the ASF to handle credentials which would be part of a CI/upload to Docker Hub. I would guess that this has been already handled given there is a process to publish the existing amd64 images?
Aside, I added one tiny comment on the PR :-)
Any chance we could have multi-arch for 2.9.2 ?
@nextgens , I don't have enough knowledge to move forward on this PR alone. If there's a simpler way to achieve multi-arch with fewer mods to our current process, I'd be more than happy to review.
If be happy to help with reviewing it, but I don't see there is a simpler way. Using actions is the best way to achieve this end goal.
Let me ping infra at asf to see what we need to do to get this working as an action. I think that's the blocker for me.
Pinged asf infra on credentials and how to do this for an asf project.
I opened: https://issues.apache.org/jira/browse/TIKA-4258 to track this on our JIRA. I also opened an issue on infra.
It looks like Airflow at least has moved away from github actions and moved towards a release manager building locally and pushing to dockerhub -- https://cwiki.apache.org/confluence/display/INFRA/Github+Actions+to+DockerHub
Can I just use buildx along these lines to use our current workflow without github actions? https://developers.redhat.com/articles/2023/11/03/how-build-multi-architecture-container-images#docker_buildx
If securing the credentials required for dockerhub is the only concern, I think using github container registry instead may be a great solution. https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry
If you still want the images to be on dockerhub you could sync them (locally or otherwise) using a tool such as https://github.com/regclient/regclient/. We use it in Mailu, see https://github.com/Mailu/Mailu/blob/master/.github/workflows/mirror.yml#L35
How's this for a proposed way forward?
We basically keep our current workflow on the release manager's laptop/hardware. We modify our build scripts to build a single-arch image, run our usual tests and then do a second call to docker buildx where we build multiarch images and then deploy to dockerhub?
If securing the credentials required for dockerhub is the only concern, I think using github container registry instead may be a great solution. https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry
If you still want the images to be on dockerhub you could sync them (locally or otherwise) using a tool such as https://github.com/regclient/regclient/. We use it in Mailu, see https://github.com/Mailu/Mailu/blob/master/.github/workflows/mirror.yml#L35
Awesome. Thank you. ASF infra has a way to do the auth. My current thinking is not to rework our workflow into github actions, but rather see if we can tweak our current workflow to get multi-arch images.
I think building multiarch with buildx requires QEMU, but as long as that's available on the host doing the builds just running buildx should be perfectly fine - that's all the github workflow does after all!
As a side note, I'd also recommend uploading the images to both GHCR and Docker Hub nowadays if that's an option - Docker have started putting restrictions in place to push users into subscriptions so alternatives are good to have. I think you can just specify multiple tags with the docker buildx
call to upload to multiple registries (as long as you're authenticated with them all), so eg. -t ghcr.io/apache/tika:latest -t docker.io/apache/tika:latest
should upload the image to both of those repos (and even more could be added similarly, eg. RedHat's Quay or an Apache-owned registry if such a thing happens at some point).
Let's add other registries on a later ticket?
How's this look? https://github.com/apache/tika-docker/pull/21
I haven't tested it -- famous last words...
Wow...it looks like it actually worked?!
Can you all give this a shot? https://hub.docker.com/layers/apache/tika/2.9.2-alpha-multi-arch/images/sha256-b8b6e02e3e9f98ddae33b74881f4ead7846ee12352d53149098857378bb3393d?context=repo