pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

build and publish ARM images for kubeflow pipelines

Open thesuperzapper opened this issue 2 years ago • 25 comments

Description

Currently, Kubeflow Pipelines is only publishing amd64 container images, most other Kubeflow components are now publishing for both amd64 and arm64.

Here is the list of images that need to be updated: (this was the list for 2.0.0-alpha.7, more may have been added for 2.0.0+)

  • gcr.io/ml-pipeline/cache-server
  • gcr.io/ml-pipeline/metadata-envoy
  • gcr.io/ml-pipeline/metadata-writer
  • gcr.io/ml-pipeline/api-server
  • gcr.io/ml-pipeline/persistenceagent
  • gcr.io/ml-pipeline/scheduledworkflow
  • gcr.io/ml-pipeline/frontend
  • gcr.io/ml-pipeline/viewer-crd-controller
  • gcr.io/ml-pipeline/visualization-server
  • gcr.io/tfx-oss-public/ml_metadata_store_server
  • gcr.io/google-containers/busybox

While most of these can run under Rosetta (on Apple Silicon Macs only), they run much slower and so are really only useful for testing.

Furthermore, the gcr.io/tfx-oss-public/ml_metadata_store_server image straight up does not work (even under emulation), I have made a separate Issue to track this one, as it is not controlled by KFP and is part of google/ml-metadata:

  • https://github.com/kubeflow/pipelines/issues/10308

Love this idea? Give it a 👍.

thesuperzapper avatar Dec 12 '23 23:12 thesuperzapper

@chensun @zijianjoy I think this is a very important issue, as ARM64 (especially MacBooks) are now very common.

thesuperzapper avatar Dec 12 '23 23:12 thesuperzapper

I can see that there was a merged PR to make some builds succeed on ARM64 (from 2019):

  • https://github.com/kubeflow/pipelines/pull/2507

But another one got closed due to inactivity:

  • https://github.com/kubeflow/pipelines/pull/3839

I will tag the author of those PRs so they can comment on this @MrXinWang.

thesuperzapper avatar Dec 12 '23 23:12 thesuperzapper

@thesuperzapper Let me know how can I help with this.

rimolive avatar Dec 13 '23 14:12 rimolive

+1 on this issue. Each quarter, more people are switching to Apple Silicon from older Intel Macs

Talador12 avatar Jan 11 '24 22:01 Talador12

Another image is gcr.io/google-containers/busybox, which is used in place of the real image for cached pipeline steps (to run echo that says the step is cached).

thesuperzapper avatar Feb 26 '24 08:02 thesuperzapper

In my testing trying to build the images for linux/arm64, the only hard blockers are actually Python packages in the following images:

  • gcr.io/ml-pipeline/metadata-writer: (the problem is ML Metadata)
    • https://github.com/kubeflow/pipelines/blob/master/backend/metadata_writer/Dockerfile
    • https://github.com/kubeflow/pipelines/blob/master/backend/metadata_writer/requirements.in
  • gcr.io/ml-pipeline/visualization-server: (the problem is TFX)
    • https://github.com/kubeflow/pipelines/blob/master/backend/Dockerfile.visualization
    • https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/visualization/requirements.in

The problematic pip packages are:

  • ML Metadata:
    • ml-metadata (https://github.com/google/ml-metadata)
  • TFX Stuff:
    • tensorflow-model-analysis (https://github.com/tensorflow/model-analysis)
    • tensorflow-data-validation (https://github.com/tensorflow/data-validation)
    • tensorflow-serving-api (https://github.com/tensorflow/serving)
    • tensorflow-transform (https://github.com/tensorflow/transform)
    • tfx-bsl (https://github.com/tensorflow/tfx-bsl)
      • (this one is a transitive dependency of the others)

There are already upstream Issues for some of them, but they mostly relate to Apple Silicone (slightly different from Linux ARM64), but I imagine that solving one will make it much easier to solve the other:

  • ml-metadata:
    • https://github.com/google/ml-metadata/issues/143
  • tensorflow-model-analysis:
    • (no existing upstream issues)
  • tensorflow-data-validation:
    • https://github.com/tensorflow/data-validation/issues/205 (the main issue about getting tfx working)
    • https://github.com/tensorflow/data-validation/issues/141
  • tensorflow-serving-api:
    • https://github.com/tensorflow/serving/issues/1816
    • https://github.com/tensorflow/serving/issues/1948
  • tensorflow-transform:
    • https://github.com/tensorflow/transform/issues/298
  • tfx-bsl:
    • https://github.com/tensorflow/tfx-bsl/issues/48

We either need to get those packages working so they can be pip installed on a Linux ARM, or remove our dependency on them.

thesuperzapper avatar Apr 13 '24 20:04 thesuperzapper

@thesuperzapper metadata-write and visualization-server are kfpv1 deprecated components, so they're not required for kfpv2.

rimolive avatar Apr 15 '24 20:04 rimolive

We run a small ARM-based cluster which we want to run Kubeflow on, so I have started to build the components for ARM. I've been successful at building the cache-server, persistence agent, scheduled workflow agent, viewer-crd-controller, and frontend. I only had to set --platform=$BUILDPLATFORM as an argument in the first Dockerfile stage and, for all the Go based components, add GOOS=$TARGETOS GOARCH=$TARGETARCH in the go build step. However, building the API server seems to need a little more work.

The main reason for this, is that https://github.com/mattn/go-sqlite3/ now needs to be compiled with a cross-compiler, so I have to run apt-get install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu, and set CC=aarch64-linux-gnu-gcc CXX=aarch64-linux-gnu-g++ CGO_ENABLED=1 environment variables during go build, which works!

However, this seems very fragile to changes in build server, new CPU architectures, etc., so I looked into why we even include SQLite - and the answer seems to be that we only use SQLite for integration testing? So perhaps it would make sense to exclude it in the production image?

One way to do this is to move SQLite references to a separate db_sqlite.go file and use a // +build integration tag, and change test runs to use go test --tags=integration for integration tests. That would make it possible to build the API server without additional C/C++ cross-compilers.

In fact, I have done this on our custom build and now I can build the binary and Docker container without SQLite with the same configuration change as with the other components mentioned above.

AndersBennedsgaard avatar May 15 '24 09:05 AndersBennedsgaard

I am considering looking at contributing some of my changes here, but I can't really figure out how the images are built. I expect that it has something to do with https://github.com/kubeflow/pipelines/blob/master/.cloudbuild.yaml? Perhaps @rimolive can give some pointers?

Also, what do you think of my proposal to remove SQLite from the final Go binary and only enable it for integration tests using build flags?

AndersBennedsgaard avatar Jun 12 '24 06:06 AndersBennedsgaard

@AndersBennedsgaard if you want a quick way to build all the images for testing, you can use the same approach as the deployKF fork of Kubeflow Pipelines deployKF/kubeflow-pipelines which uses GitHub Actions (GHA) to build the images.

You can just take the same GHA configs as we add in this commit: https://github.com/deployKF/kubeflow-pipelines/commit/d800253041febdf3ac2d5124d836e01a6a878e92. Even if you don't use the GHA configs directly, you can use them to figure out the full list of images that make up Kubeflow Pipelines and where their Dockerfile is.

NOTE: these workflows have build_platforms set to linux/amd64, but you could update it to linux/amd64 linux/arm64 (whitespace seperate) once you fix the ARM build issues, and they will then be built for both architectures.

NOTE 2: this excludes the gcr.io/tfx-oss-public/ml_metadata_store_server image, which is managed upstream (google/ml-metadata), and which I made a PR to allow building on ARM (https://github.com/google/ml-metadata/pull/188), but even if they merged that, Google doesnt know how to build ARM images (or something like that), so we have a fork for that too (deployKF/ml-metadata), but you can just use this following image which is cross-compiled for ARM/X86 ghcr.io/deploykf/ml_metadata_store_server:1.14.0-deploykf.0

thesuperzapper avatar Jun 13 '24 02:06 thesuperzapper

@thesuperzapper as I mentioned in https://github.com/kubeflow/pipelines/issues/10309#issuecomment-2111979084, we already have KFP fully running on an ARM-only cluster, so I have already cross-compiled the images using BuildX+Qemu in our own fork. I was talking about contributing the changes back upstream, but if you say that "Google doesnt know how to build ARM images", it might be hard for me to do. Alternatively, we could consider switching the CI pipeline to GH actions, since most(all?) other Kubeflow components already use this

AndersBennedsgaard avatar Jun 13 '24 06:06 AndersBennedsgaard

Alternatively, we could consider switching the CI pipeline to GH actions, since most(all?) other Kubeflow components already use this

We are already working on migrating the CI pipelines to GitHub Actions. See https://github.com/kubeflow/pipelines/issues/10744

rimolive avatar Jun 13 '24 11:06 rimolive

@rimolive #10744 does not mention changing the release workflow logic to GH actions. Should we include these in that issue?

@thesuperzapper would you mind adding all the relevant -license-compliance images built for KFP? Such as gci.io/ml-pipeline/workflow-controller

AndersBennedsgaard avatar Jun 17 '24 15:06 AndersBennedsgaard

@rimolive https://github.com/kubeflow/pipelines/issues/10744 does not mention changing the release workflow logic to GH actions. Should we include these in that issue?

Our priority is fixing the tests, we can figure out moving release workflow to GHA too but in another moment.

rimolive avatar Jun 17 '24 15:06 rimolive

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 17 '24 07:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Sep 08 '24 07:09 github-actions[bot]

/reopen

thesuperzapper avatar Sep 08 '24 16:09 thesuperzapper

@thesuperzapper: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Sep 08 '24 16:09 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 09 '24 07:11 github-actions[bot]

(still relevant, bumping comment to avoid stale status)

tarilabs avatar Nov 09 '24 07:11 tarilabs

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jan 09 '25 07:01 github-actions[bot]

/lifecycle frozen

hbelmiro avatar Jan 09 '25 11:01 hbelmiro

Hi. Any updates on this issue ? Is there a current list of whats working and whats missing ? Thanks in advance !

Matt.

mattsee avatar Jan 27 '25 20:01 mattsee