community icon indicating copy to clipboard operation
community copied to clipboard

Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le)

Open lehrig opened this issue 3 years ago • 21 comments

/kind feature

Enable builds & releases for IBM Power (ppc64le architecture). This proposal was presented with these slides at the 2022-10-25 Kubeflow community call with positive community feedback. We also created this design documentation: https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing

Why you need this feature:

  • Widen scope of possible on-premises deployments (vanilla Kubernetes & OpenShift on Power)
  • More general independence regarding processor architecture (x86, ppc64le, arm, …)
  • Unified container builds

Describe the solution you'd like:

  • Upstreaming changes that allow to build Dockerfiles on multiple architecture (starting with x86 & ppc64le)
  • Upstreaming CI integration for multi-arch builds (starting with x86 & ppc64le)

We currently plan to divide our efforts into multiply phases:

  1. low-hanging "easy" integrations where no or minor code changes are needed; excluding KFP; Kubeflow 1.7 release scope (✅ done),
  2. same as 1. but now including additional KServe components for model serving; Kubeflow 1.8 release scope,
  3. same as 1. but now including KFP; Kubeflow 1.9 release scope,
  4. more complex integrations where external dependencies to python wheels exist.

Below is a detailed overview of each required integration, including links to associated PRs if those already exist.

Phase 1 Integrations (Kubeflow 1.7 scope)

  • [x] Poddefaults (Admission) Webhook: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6803 🚀 https://hub.docker.com/r/kubeflownotebookswg/poddefaults-webhook/tags
  • [x] Central Dashboard: https://github.com/kubeflow/kubeflow/pull/6861, https://github.com/kubeflow/kubeflow/pull/6923 🚀 https://hub.docker.com/r/kubeflownotebookswg/centraldashboard/tags
  • [x] Jupyter Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6800 🚀 https://hub.docker.com/r/kubeflownotebookswg/jupyter-web-app/tags
  • [x] KServe: Agent: https://github.com/kserve/kserve/pull/2476, https://github.com/kserve/kserve/pull/2549 🚀 https://hub.docker.com/r/kserve/agent/tags
  • [x] KServe: Controller: https://github.com/kserve/kserve/pull/2476, https://github.com/kserve/kserve/pull/2550 🚀 https://hub.docker.com/r/kserve/kserve-controller/tags
  • [x] KServe: Models Web App: https://github.com/kserve/models-web-app/pull/45, https://github.com/kserve/models-web-app/pull/55 🚀 https://hub.docker.com/r/kserve/models-web-app/tags
  • [x] KServe: QPExt: https://github.com/kserve/kserve/pull/2604 🚀 https://hub.docker.com/r/kserve/qpext/tags
  • [x] KServe: Router: https://github.com/kserve/kserve/pull/2605 🚀 https://hub.docker.com/r/kserve/router/tags
  • [x] MPI Operator: https://github.com/kubeflow/mpi-operator/pull/489 🚀 https://hub.docker.com/r/mpioperator/mpi-operator/tags
  • [x] Notebook Controller: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6771 🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
  • [x] Profiles + KFAM: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6785, https://github.com/kubeflow/kubeflow/pull/6809 🚀 https://hub.docker.com/r/kubeflownotebookswg/profile-controller/tags 🚀 https://hub.docker.com/r/kubeflownotebookswg/kfam/tags
  • [x] Tensorboard Controller: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6805 🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
  • [x] Tensorboard Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6810 🚀 https://hub.docker.com/r/kubeflownotebookswg/tensorboards-web-app/tags
  • [x] Training Operator: https://github.com/kubeflow/training-operator/pull/1674, https://github.com/kubeflow/training-operator/pull/1692 🚀 https://hub.docker.com/r/kubeflow/training-operator/tags
  • [x] Volumes Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6811 🚀 https://hub.docker.com/r/kubeflownotebookswg/volumes-web-app/tags

Phase 2 Integrations (Kubeflow 1.9 scope)

  • [ ] KServe: PMML Server
  • [ ] KServe: AIX
  • [ ] KServe: Alibi
  • [ ] KServe: Art
  • [ ] Triton Inference Server (external)
  • [ ] Seldon: ML Server (external)
  • [ ] PyTorch: TorchServe (external)

Phase 3 Integrations (Kubeflow 1.10 scope)

Note: KFP is currently blocked by https://github.com/kubeflow/pipelines/issues/8660 / https://github.com/GoogleCloudPlatform/oss-test-infra/issues/1972

  • [ ] KFP: Application-CRD-Controller
  • [ ] KFP: Argoexec
  • [ ] KFP: Cache-Server
  • [ ] KFP: Frontend
  • [ ] KFP: Metadata Envoy
  • [ ] KFP: Persistence Agent
  • [ ] KFP: Scheduled Workflow
  • [ ] KFP: Workflow Controller
  • [ ] KFP: Viewer-CRD-Controller
  • [ ] KServe: LGB Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] KServe: Paddle Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] KServe: SKLearn Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] KServe: XGB Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] Katib: controller, db-manager, ui
  • [ ] Katib: file-metrics-collector
  • [ ] Katib: tfevent-metrics-collector
  • [ ] Katib: suggestion-hyperopt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-chocolate
  • [ ] Katib: suggestion-hyperband: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-skopt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-goptuna
  • [ ] Katib: suggestion-optuna: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-enas
  • [ ] Katib: suggestion-darts: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-pbt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: earlystopping-medianstop: https://github.com/kubeflow/katib/pull/2290

Phase 4 Integrations (Post Kubeflow 1.11 scope)

  • [ ] KFP: Api Server
  • [ ] KFP: Metadata Writer
  • [ ] KFP: Visualization Server
  • [ ] ml-metadata (KFP wheel dep.): https://github.com/google/ml-metadata/pull/171
  • [ ] KServe: Storage Initializer: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] ~~OIDC Auth (external): https://github.com/arrikto/oidc-authservice/issues/104; on-hold as potentially irrelevant as of Kubeflow v1.8 (https://github.com/kubeflow/manifests/issues/2469)~~

lehrig avatar Oct 25 '22 18:10 lehrig

Thanks for creating this tracking issue @lehrig!

I'm onboard with adding support for ppc64le, since this will greatly help KF adoption. The proposed plan makes sense.

My initial question at this time is whether we need to build different executables for this platform, which means we need a new set of images. I see in the PRs that the only needed change is to actually not set a specific platform, but I might be missing something.

Could you provide some more context on this one?

kimwnasptd avatar Oct 31 '22 12:10 kimwnasptd

@kimwnasptd, thanks for your support!

There are essentially 2 options for publishing images:

  1. Multi-arch images, where we publish only 1 "virtual" image with support for multiple architectures. A pull command will then only fetch the concrete container image for the required platform. To do so, I'd recommend using buildx (e.g., see https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/) because it is easier to use / more automated compared to manually creating a container manifest file for multiple architectures.
  2. Separate images per architecture.

IMO 1. should be the preferred solution. A challenge here will be that builds over all Kubeflow components are quite inconsistent. For example, some projects already use buildx while some others don't. I'd opt for implementing more consistency in the scope of this endeavor, e.g., by migrating builds towards buildx where feasible. My team would be willing to drive this, if this sounds good.

lehrig avatar Nov 03 '22 08:11 lehrig

@lehrig I agree with the first approach as well, if it's viable to avoid having multiple manifests.

Docker's buildx seems promising. I hadn't used it the past though, but it seems quite straight forward. I don't have a hard preference on using buildx as long as we don't lock in ourselves and end up with Dockerfiles that need specific Docker features and can only be build with docker.

kimwnasptd avatar Nov 10 '22 16:11 kimwnasptd

As wished by @kimwnasptd, quoting myself from https://github.com/kubeflow/kubeflow/pull/6650#discussion_r1029586906 to clarify how we envision multi-arch builds:

Yes, it's good to let GO determine the arch, so we don't have to maintain an arch list explicitly here or wrap around boiler-plate code with arch-specific if/else statements.

Instead, we now shift the control which arch is actually build to the build system. If doing nothing special, the arch of the build machine is simply used (and as Kubeflow is currently build on amd64, we are backwards-compatible to the current behavior).

In further PRs, we will additionally modify docker-based builds using buildx, where you can, for instance, do something like this: docker buildx build --platform linux/amd64,linux/ppc64le ...

Here, docker will automatically run actually 2 builds: one for amd64 and one for ppc64le. When coming to above GO-code, GO will acknowledge the external platform configuration and build it correctly. In case no native hardware is available for the given platform, docker will emulate the architecture using QEMU - so you can also build for different archs on amd64.

The final outcome is a single multi-arch image with support of all archs listed in Docker's platform statement.

lehrig avatar Nov 23 '22 13:11 lehrig

@lehrig @adilhusain-s! the first image in this repo with support for ppc64le is up! 🎉

https://hub.docker.com/layers/kubeflownotebookswg/notebook-controller/80f695e/images/sha256-2870219816f6be1153ca97eb604b4f20393c34afdb4eade83f0966ccf90f8018?context=explore

kimwnasptd avatar Dec 02 '22 14:12 kimwnasptd

@lehrig @pranavpandit1 @adilhusain-s I realized that right now we've only implemented the logic for using docker buildx only for the Actions that run when a PR is merged, and not when a PR is opened.

Realized that the Centraldashboard image was not getting build, even though the PR checks were green: 28a24ffb170769a228d46a19892f7420b22a0816 74f020e0d9c3f58712a3b466f9d1bb86c4607beb 65e41bf28b8e79be4e1f822afe56e218c69db8a1

We fixed the issue for this in https://github.com/kubeflow/kubeflow/pull/6960, but we should be able to catch errors for the multi-arch build when a PR is opened as well.

The fix should be straightforward. We'll just need to use the same build command in both types of actions. Referencing the relevant parts in one: https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/centraldb_intergration_test.yaml#L24 https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/centraldb_docker_publish.yaml#L37-L41

kimwnasptd avatar Feb 15 '23 13:02 kimwnasptd

Do you have cycles to help with this effort?

kimwnasptd avatar Feb 15 '23 13:02 kimwnasptd

@kimwnasptd let me confirm with the team but I think we can handle it

lehrig avatar Feb 15 '23 13:02 lehrig

@kimwnasptd let me confirm with the team but I think we can handle it

@kimwnasptd: Thanks for all the inputs, we have started looking into the required changes and will keep everyone updated once we start raising PRs for the same.

pranavpandit1 avatar Feb 17 '23 11:02 pranavpandit1

Note I updated the main description by adding a phase for Kubeflow 1.8 scope and linking to this novel design document https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing

We use this document to discuss Phase 2 with the KFP community (related to https://github.com/kubeflow/pipelines/issues/8660).

lehrig avatar Feb 17 '23 14:02 lehrig

Thanks @pranavpandit1! I also took a look on how to get these to work, when fixing the workflows for the CentralDashboard. You can take a look at this PR and some comments https://github.com/kubeflow/kubeflow/pull/6961

kimwnasptd avatar Feb 21 '23 11:02 kimwnasptd

@lehrig I think we bumped into a side-effect that I hadn't thought about initially. Building the images in the GH Actions (which is doing virtualization via QEMU) is actually slow.

Looking at an open PR https://github.com/kubeflow/kubeflow/pull/7060 that touches some web apps I see the following:

  1. The workflow that builds for both platforms (VWA) takes 61minutes (!)
  2. The workflow that builds with the old way (TWA) takes 9mins image

The difference is huge, so I want to re-evaluate the approach of building for ppc64le when PRs are opened.

From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail?

kimwnasptd avatar Mar 24 '23 15:03 kimwnasptd

@kimwnasptd yeah, I agree that this is suboptimal. The answer is obviously "it depends", however, I think we have some hard evidence here that we should not proceed as originally planned. I see the following options for builds on PR opened.

  1. Exclude non-x86 archs as long as native hardware is unavailable (example: https://github.com/DSpace/dspace-angular/pull/1667).
  2. Exclude non-x86 archs only if building them takes too long.
  3. Wait for native ppc64le out-of-the-box support for GHA, which hopefully comes this year (and this will not slow-down builds as emulation is not used).
  4. Integrate a SSH-based connection to native hardware we can provide into the workflow (see this example: https://github.com/adilhusain-s/multi-arch-docker-example/blob/main/.github/workflows/native_docker_builder.yaml#L31).
  5. Integrate a GitHub app connecting GHA builds to native hardware when needed to native hardware (experimental).
  6. (not sure this is technically possible) Start the ppc64le QEMU-build asynchonously & don't let the PR wait for ppc64le build completion, so it doesn't block.

Note: Options 1, 2 & 6 are based on my observation that ppc64le is typically error-free when x86 builds without errors. Hence, we can typically accept PRs when only x86 builds. Rare corner cases are then discovered on PR merge.

If exclusion (options 1 or 2) is OK, I'd go for this option & later migrate to option 3 once native ppc64le GHA support becomes available later this year. Option 4 is possible but would require some additional efforts and organization on our side; so I see it only as a backup option. Same for option 5. Option 6 has not been tested thus far, so I would not go for it at the moment.

lehrig avatar Mar 27 '23 09:03 lehrig

Here are some stats that help getting a feeling for those options:

Notebook-controller build

  • Native ppc64le: 2.2 min
  • QEMU ppc64le: 12 min and 47sec
  • Native x86: 2 min and 6 sec

Volume-web-app build

  • Native ppc64le: 8.21 min
  • QEMU ppc64le: 36 min and 35sec
  • Native x86: 6 min and 56 sec

Central-dashboard build

  • Native ppc64le: 3.22 min
  • QEMU ppc64le: 10 min and 26sec
  • Native x86: 1 min and 55 sec

lehrig avatar Mar 28 '23 11:03 lehrig

I discussed the options with the team. Here is our proposal:

  • On PR opened, we recommend option 2: only build PRs if they don't take too long and otherwise disable ppc64le builds like we showcase in https://github.com/DSpace/dspace-angular/pull/1667
  • Looking at above stats, we believe "not too long" holds for all builds <= 30 min.
  • As soon as option 3 becomes available, migrate all workflows to this option: run everything natively and enable it for all PR opened.
  • On PR merged, we recommend to always build all supported architectures.
  • We also recommend to generally improve build performance by enabling caching during builds, which will generally lower building times by 30-40%: https://github.com/kubeflow/community/issues/779. If that is enabled, we will get more components under the 30 min. threshold.

@kimwnasptd, does that sound good? What do you think about the 30 min. threshold?

lehrig avatar Mar 28 '23 12:03 lehrig

Still have to answer this:

From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail?

Seldomly happens once ppc64le support is there. The only case that is a bit harder are new 3rd-party dependencies, for example, additional python wheels unavailable on ppc64le (wheels are thus in Phase 3 of this endeavor). With Go/Java/JS code we typically don't see these kind of issues as they are more architecturally independent than then Python ecosystem.

lehrig avatar Mar 28 '23 12:03 lehrig

@lehrig thangs for the detailed explanation! I agree with your proposal and rationale. So my current understanding is the following, but please tell me if I miss something:

  1. We can skip building multi-platform images when testing PRs, since we don't expect any issues
  2. Build for all architectures when a PR is merged, and GHA build and publish images
  3. Once we have native ppc64le support for GHA out-of-the-box we can migrate the workflows to use it

At the same time we can also work on caching in parallel https://github.com/kubeflow/community/issues/779. Also if we in the future we see that there are a lot of issues when building/pushing the images between architectures we can then come back to evaluating building multi-arch images during opened PRs.

kimwnasptd avatar Jun 04 '23 10:06 kimwnasptd

Updated list of integrations by expanding phases & adding some smaller images for KServe + Katib. KFP is still moving slowly as it builds in another CI system, so we will first focus on KServe and Katib more.

lehrig avatar Aug 10 '23 10:08 lehrig

Let's continue this discussion in the community repo. /transfer community

andreyvelich avatar Oct 17 '24 15:10 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 24 '25 00:05 github-actions[bot]

Our team is quite actively working on this; please keep open.

lehrig avatar May 26 '25 16:05 lehrig

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 25 '25 00:08 github-actions[bot]

Worked on following dependencies and components:

Triton: PR raised to enable support for Triton server with python backend 8329 ml-metadata: PR raised to build ml-metadata on Power with GCC-11 218 KFP: Frontend: PR Merged 12125

alhad-deshpande avatar Aug 25 '25 04:08 alhad-deshpande