markitdown Add GitHub Actions workflow to build and push multi-arch Docker image

This will build image for both archs.

Apr 13 '25 10:04 seuros

@microsoft-github-policy-service agree

Apr 13 '25 10:04 seuros

Hey, I just made a similar PR yesterday : https://github.com/microsoft/markitdown/pull/1184

I have a question about some choices you made on your workflow ; normally i work with Gitlab and not Github so it was the first workflow i made, i henceforth don't have much knowledge about them.

Could you please tell my why you use docker/setup-qemu-action@v3 ? As the workflow is for building Docker images, i don't quite understand why you put a Qemu action...

As for docker/setup-buildx-action@v3, i just followed a tutorial that was not using it so it's my bad :)

Thanks !

Apr 13 '25 12:04 goulvenb

This looks very promising. Thank you. (Also with #1184)

Perhaps the two of you can discuss the differences, and we arrive at the best.

But, before that: I need to check some procedural stuff on my end before I can merge this (re: hosting images rather than just Dockerfiles). Let me spend a day or two on this, and I'll get back to you.

Apr 13 '25 16:04 afourney

I just answered in #1184 I believe it would be better to communicate either here of in #1184, so i commented first on my PR, but in case you want to continue here here's the gist of it :

Hey @afourney, As i just said in #1186 :

it was the first workflow i made, i henceforth don't have much knowledge about them

However, i would love to talk about it with @seuros.

From what i see, here's a table difference :

#1184	#1186
We create a container on tag creation	We create a container when pushing on branch `main` ; from a PR or not
We do not set the permissions of the PAT ; it is implicitly set	We set the permissions of the PAT to the minimum
Use the 6th version of `docker/build-push-action`	Use the 5th version of `docker/build-push-action`

Additionally, @seuros added a docker/setup-qemu-action@v3 action that i do not know the use of, and a docker/setup-buildx-action@v3 action that allow to create, if i understand well, an image for the architecture "linux/amd64" and "linux/arm64"

From what i see, i believe we should implement the following :

#1184 : We create a container on tag creation
- This way, we'll have access to ghcr.io/microsoft/markitdown:v1.0.0, ghcr.io/microsoft/markitdown:v2.0.0, ... Instead of just ghcr.io/microsoft/markitdown:main
#1186 : We set the permissions of the PAT to the minimum
- Better security practice to set the permissions in hard instead of implicitly letting Github to set them
#1184 : Use the 6th version of docker/build-push-action
- I don't see why v5 should be used instead of v6 ?
#1186 : Use docker/setup-buildx-action@v3
- Did not know about this, but it's better to create an image for different CPU architecture

For the Qemu action, i don't really know until i've talked with @seuros.

Apr 13 '25 17:04 goulvenb

Hey @goulvenb

Thanks for your message - and nice work on your PR (#1184)!

To answer your question about docker/setup-qemu-action@v3: we include it because the workflow builds Docker images for multiple architectures, namely amd64 and arm64. QEMU enables emulation for these platforms in the same environment.

As for docker/setup-buildx-action@v3, it's required to use Docker Buildx, which supports multi-platform builds and caching strategies. It use build-push-action and qemu under the hood.

Also, just to clarify: I had this workflow running for a while and only saw your PR after a notification from the bot requesting me to agree for the terms.

To collaborate better: – I added you to my repo, so you can push updates like version upgrades and tagging directly if you'd like — this way the PR will be co-authored. – Alternatively, feel free to continue working from your repo and cherry-pick this commit, but your PR will need to have just 2 commits. Then i will close this one.

Sorry for the delay in responding, I was on the road. Looking forward to syncing up on this!

Thanks again!

Apr 13 '25 23:04 seuros

No worries @seuros ! And also, thanks for your answers to my questions !

As you added me to your repo, and as i'm still relatively new to collaboration using git, i believe it would be wiser to work from your fork.

I just made a branch dev/goulvenb with my future edits on it. For now, it contain the recommendations i made above plus editing the README.md file to use the newly provided docker images. I made a PR to merge it into our main branch.

If you see any additional modification we can make, feel free to make them ! Personally, apart from this PR, i believe we are ready to go.

Thanks !

Apr 14 '25 06:04 goulvenb

I forgot to ask :

but your PR will need to have just 2 commits.

Why 2 commits ?

I'm asking more in the way of "why would afourney care about the number of commits" ? Like i would personally prefer to have 10/20 commits to know "why" a particular line was made, but maybe there's another reason ?

Apr 14 '25 06:04 goulvenb

When reviewing code in open source projects, maintaining clean, purposeful commits is important. Having just 2 commits makes perfect sense since we are pairing in this one.

Why clean commits matter: Clean commits make the codebase history more navigable and useful. In open source especially, where many contributors interact with the code, commit history becomes a critical documentation tool. There are tools that search for clues commit by commit to know why something was decided. You don't want it to end up in commits where you were just tinkering.

When a maintainer sees 10 commits on a Monday morning, they expect significant feature additions - not just 2 features where a developer spent 8 commits figuring things out. This wastes review time and makes the project harder to understand.

While you might see 10 commits, the size impact on the repo is significant if there are too many iterations. If you add a YouTube video, a log file or a binary in one commit by mistake, then delete it in another, the maintainer will see a clean final change, but if merged and not squashed, everyone will have to download that file as part of the git history forever.

A good practice when starting out is to have 1 commit PER PR you open.

One trick to make it clean is to open a PR against your own repo and squash it. Later you'll get familiar with amending and rebasing from the console. ( i will squash your PR in my fork )

You'll also get less rejection from the OSS community to merge your work. (Speaking from experience)

Apr 14 '25 07:04 seuros

Thank you for your insight ! As i said, i'm pretty new to collaboration using git so any intake is welcome 😊

I'm gonna search a few terms, like squashing vs merging, but now at least i know there's an alternative to merging.

Back to the topic, i'll let you ping afourney when you think it's ready to go. Thanks again !

Apr 14 '25 11:04 goulvenb

Well, I think @seuros did not see my message...

However, as everything is up to date and as no uncommitted changes were made in a week (in our repo), i believe this is the best of our 2 PR, @afourney.

Apr 20 '25 11:04 goulvenb

@afourney any updates? The race is on: this PR merge vs WW3 vs GTA6 release. My money's still on this PR!

Jun 14 '25 22:06 seuros

@afourney could this be merged, I really need it in docker...

Oct 31 '25 18:10 theurzil