community
community copied to clipboard
Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le)
/kind feature
Enable builds & releases for IBM Power (ppc64le architecture). This proposal was presented with these slides at the 2022-10-25 Kubeflow community call with positive community feedback. We also created this design documentation: https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing
Why you need this feature:
- Widen scope of possible on-premises deployments (vanilla Kubernetes & OpenShift on Power)
- More general independence regarding processor architecture (x86, ppc64le, arm, …)
- Unified container builds
Describe the solution you'd like:
- Upstreaming changes that allow to build Dockerfiles on multiple architecture (starting with x86 & ppc64le)
- Upstreaming CI integration for multi-arch builds (starting with x86 & ppc64le)
We currently plan to divide our efforts into multiply phases:
- low-hanging "easy" integrations where no or minor code changes are needed; excluding KFP; Kubeflow 1.7 release scope (✅ done),
- same as 1. but now including additional KServe components for model serving; Kubeflow 1.8 release scope,
- same as 1. but now including KFP; Kubeflow 1.9 release scope,
- more complex integrations where external dependencies to python wheels exist.
Below is a detailed overview of each required integration, including links to associated PRs if those already exist.
Phase 1 Integrations (Kubeflow 1.7 scope)
- [x] Poddefaults (Admission) Webhook: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6803 🚀 https://hub.docker.com/r/kubeflownotebookswg/poddefaults-webhook/tags
- [x] Central Dashboard: https://github.com/kubeflow/kubeflow/pull/6861, https://github.com/kubeflow/kubeflow/pull/6923 🚀 https://hub.docker.com/r/kubeflownotebookswg/centraldashboard/tags
- [x] Jupyter Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6800 🚀 https://hub.docker.com/r/kubeflownotebookswg/jupyter-web-app/tags
- [x] KServe: Agent: https://github.com/kserve/kserve/pull/2476, https://github.com/kserve/kserve/pull/2549 🚀 https://hub.docker.com/r/kserve/agent/tags
- [x] KServe: Controller: https://github.com/kserve/kserve/pull/2476, https://github.com/kserve/kserve/pull/2550 🚀 https://hub.docker.com/r/kserve/kserve-controller/tags
- [x] KServe: Models Web App: https://github.com/kserve/models-web-app/pull/45, https://github.com/kserve/models-web-app/pull/55 🚀 https://hub.docker.com/r/kserve/models-web-app/tags
- [x] KServe: QPExt: https://github.com/kserve/kserve/pull/2604 🚀 https://hub.docker.com/r/kserve/qpext/tags
- [x] KServe: Router: https://github.com/kserve/kserve/pull/2605 🚀 https://hub.docker.com/r/kserve/router/tags
- [x] MPI Operator: https://github.com/kubeflow/mpi-operator/pull/489 🚀 https://hub.docker.com/r/mpioperator/mpi-operator/tags
- [x] Notebook Controller: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6771 🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
- [x] Profiles + KFAM: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6785, https://github.com/kubeflow/kubeflow/pull/6809 🚀 https://hub.docker.com/r/kubeflownotebookswg/profile-controller/tags 🚀 https://hub.docker.com/r/kubeflownotebookswg/kfam/tags
- [x] Tensorboard Controller: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6805 🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
- [x] Tensorboard Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6810 🚀 https://hub.docker.com/r/kubeflownotebookswg/tensorboards-web-app/tags
- [x] Training Operator: https://github.com/kubeflow/training-operator/pull/1674, https://github.com/kubeflow/training-operator/pull/1692 🚀 https://hub.docker.com/r/kubeflow/training-operator/tags
- [x] Volumes Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6811 🚀 https://hub.docker.com/r/kubeflownotebookswg/volumes-web-app/tags
Phase 2 Integrations (Kubeflow 1.9 scope)
- [ ] KServe: PMML Server
- [ ] KServe: AIX
- [ ] KServe: Alibi
- [ ] KServe: Art
- [ ] Triton Inference Server (external)
- [ ] Seldon: ML Server (external)
- [ ] PyTorch: TorchServe (external)
Phase 3 Integrations (Kubeflow 1.10 scope)
Note: KFP is currently blocked by https://github.com/kubeflow/pipelines/issues/8660 / https://github.com/GoogleCloudPlatform/oss-test-infra/issues/1972
- [ ] KFP: Application-CRD-Controller
- [ ] KFP: Argoexec
- [ ] KFP: Cache-Server
- [ ] KFP: Frontend
- [ ] KFP: Metadata Envoy
- [ ] KFP: Persistence Agent
- [ ] KFP: Scheduled Workflow
- [ ] KFP: Workflow Controller
- [ ] KFP: Viewer-CRD-Controller
- [ ] KServe: LGB Server: blocked by https://github.com/pyca/cryptography/issues/7723
- [ ] KServe: Paddle Server: blocked by https://github.com/pyca/cryptography/issues/7723
- [ ] KServe: SKLearn Server: blocked by https://github.com/pyca/cryptography/issues/7723
- [ ] KServe: XGB Server: blocked by https://github.com/pyca/cryptography/issues/7723
- [ ] Katib: controller, db-manager, ui
- [ ] Katib: file-metrics-collector
- [ ] Katib: tfevent-metrics-collector
- [ ] Katib: suggestion-hyperopt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
- [ ] Katib: suggestion-chocolate
- [ ] Katib: suggestion-hyperband: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
- [ ] Katib: suggestion-skopt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
- [ ] Katib: suggestion-goptuna
- [ ] Katib: suggestion-optuna: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
- [ ] Katib: suggestion-enas
- [ ] Katib: suggestion-darts: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
- [ ] Katib: suggestion-pbt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
- [ ] Katib: earlystopping-medianstop: https://github.com/kubeflow/katib/pull/2290
Phase 4 Integrations (Post Kubeflow 1.11 scope)
- [ ] KFP: Api Server
- [ ] KFP: Metadata Writer
- [ ] KFP: Visualization Server
- [ ] ml-metadata (KFP wheel dep.): https://github.com/google/ml-metadata/pull/171
- [ ] KServe: Storage Initializer: blocked by https://github.com/pyca/cryptography/issues/7723
- [ ] ~~OIDC Auth (external): https://github.com/arrikto/oidc-authservice/issues/104; on-hold as potentially irrelevant as of Kubeflow v1.8 (https://github.com/kubeflow/manifests/issues/2469)~~
Thanks for creating this tracking issue @lehrig!
I'm onboard with adding support for ppc64le, since this will greatly help KF adoption. The proposed plan makes sense.
My initial question at this time is whether we need to build different executables for this platform, which means we need a new set of images. I see in the PRs that the only needed change is to actually not set a specific platform, but I might be missing something.
Could you provide some more context on this one?
@kimwnasptd, thanks for your support!
There are essentially 2 options for publishing images:
- Multi-arch images, where we publish only 1 "virtual" image with support for multiple architectures. A
pullcommand will then only fetch the concrete container image for the required platform. To do so, I'd recommend usingbuildx(e.g., see https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/) because it is easier to use / more automated compared to manually creating a container manifest file for multiple architectures. - Separate images per architecture.
IMO 1. should be the preferred solution. A challenge here will be that builds over all Kubeflow components are quite inconsistent. For example, some projects already use buildx while some others don't. I'd opt for implementing more consistency in the scope of this endeavor, e.g., by migrating builds towards buildx where feasible. My team would be willing to drive this, if this sounds good.
@lehrig I agree with the first approach as well, if it's viable to avoid having multiple manifests.
Docker's buildx seems promising. I hadn't used it the past though, but it seems quite straight forward. I don't have a hard preference on using buildx as long as we don't lock in ourselves and end up with Dockerfiles that need specific Docker features and can only be build with docker.
As wished by @kimwnasptd, quoting myself from https://github.com/kubeflow/kubeflow/pull/6650#discussion_r1029586906 to clarify how we envision multi-arch builds:
Yes, it's good to let GO determine the arch, so we don't have to maintain an arch list explicitly here or wrap around boiler-plate code with arch-specific if/else statements.
Instead, we now shift the control which arch is actually build to the build system. If doing nothing special, the arch of the build machine is simply used (and as Kubeflow is currently build on amd64, we are backwards-compatible to the current behavior).
In further PRs, we will additionally modify docker-based builds using buildx, where you can, for instance, do something like this:
docker buildx build --platform linux/amd64,linux/ppc64le ...Here, docker will automatically run actually 2 builds: one for amd64 and one for ppc64le. When coming to above GO-code, GO will acknowledge the external platform configuration and build it correctly. In case no native hardware is available for the given platform, docker will emulate the architecture using QEMU - so you can also build for different archs on amd64.
The final outcome is a single multi-arch image with support of all archs listed in Docker's platform statement.
@lehrig @adilhusain-s! the first image in this repo with support for ppc64le is up! 🎉
https://hub.docker.com/layers/kubeflownotebookswg/notebook-controller/80f695e/images/sha256-2870219816f6be1153ca97eb604b4f20393c34afdb4eade83f0966ccf90f8018?context=explore
@lehrig @pranavpandit1 @adilhusain-s I realized that right now we've only implemented the logic for using docker buildx only for the Actions that run when a PR is merged, and not when a PR is opened.
Realized that the Centraldashboard image was not getting build, even though the PR checks were green: 28a24ffb170769a228d46a19892f7420b22a0816 74f020e0d9c3f58712a3b466f9d1bb86c4607beb 65e41bf28b8e79be4e1f822afe56e218c69db8a1
We fixed the issue for this in https://github.com/kubeflow/kubeflow/pull/6960, but we should be able to catch errors for the multi-arch build when a PR is opened as well.
The fix should be straightforward. We'll just need to use the same build command in both types of actions. Referencing the relevant parts in one: https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/centraldb_intergration_test.yaml#L24 https://github.com/kubeflow/kubeflow/blob/master/.github/workflows/centraldb_docker_publish.yaml#L37-L41
Do you have cycles to help with this effort?
@kimwnasptd let me confirm with the team but I think we can handle it
@kimwnasptd let me confirm with the team but I think we can handle it
@kimwnasptd: Thanks for all the inputs, we have started looking into the required changes and will keep everyone updated once we start raising PRs for the same.
Note I updated the main description by adding a phase for Kubeflow 1.8 scope and linking to this novel design document https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing
We use this document to discuss Phase 2 with the KFP community (related to https://github.com/kubeflow/pipelines/issues/8660).
Thanks @pranavpandit1! I also took a look on how to get these to work, when fixing the workflows for the CentralDashboard. You can take a look at this PR and some comments https://github.com/kubeflow/kubeflow/pull/6961
@lehrig I think we bumped into a side-effect that I hadn't thought about initially. Building the images in the GH Actions (which is doing virtualization via QEMU) is actually slow.
Looking at an open PR https://github.com/kubeflow/kubeflow/pull/7060 that touches some web apps I see the following:
- The workflow that builds for both platforms (VWA) takes 61minutes (!)
- The workflow that builds with the old way (TWA) takes 9mins

The difference is huge, so I want to re-evaluate the approach of building for ppc64le when PRs are opened.
From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail?
@kimwnasptd yeah, I agree that this is suboptimal. The answer is obviously "it depends", however, I think we have some hard evidence here that we should not proceed as originally planned. I see the following options for builds on PR opened.
- Exclude non-x86 archs as long as native hardware is unavailable (example: https://github.com/DSpace/dspace-angular/pull/1667).
- Exclude non-x86 archs only if building them takes too long.
- Wait for native ppc64le out-of-the-box support for GHA, which hopefully comes this year (and this will not slow-down builds as emulation is not used).
- Integrate a SSH-based connection to native hardware we can provide into the workflow (see this example: https://github.com/adilhusain-s/multi-arch-docker-example/blob/main/.github/workflows/native_docker_builder.yaml#L31).
- Integrate a GitHub app connecting GHA builds to native hardware when needed to native hardware (experimental).
- (not sure this is technically possible) Start the ppc64le QEMU-build asynchonously & don't let the PR wait for ppc64le build completion, so it doesn't block.
Note: Options 1, 2 & 6 are based on my observation that ppc64le is typically error-free when x86 builds without errors. Hence, we can typically accept PRs when only x86 builds. Rare corner cases are then discovered on PR merge.
If exclusion (options 1 or 2) is OK, I'd go for this option & later migrate to option 3 once native ppc64le GHA support becomes available later this year. Option 4 is possible but would require some additional efforts and organization on our side; so I see it only as a backup option. Same for option 5. Option 6 has not been tested thus far, so I would not go for it at the moment.
Here are some stats that help getting a feeling for those options:
Notebook-controller build
- Native ppc64le: 2.2 min
- QEMU ppc64le: 12 min and 47sec
- Native x86: 2 min and 6 sec
Volume-web-app build
- Native ppc64le: 8.21 min
- QEMU ppc64le: 36 min and 35sec
- Native x86: 6 min and 56 sec
Central-dashboard build
- Native ppc64le: 3.22 min
- QEMU ppc64le: 10 min and 26sec
- Native x86: 1 min and 55 sec
I discussed the options with the team. Here is our proposal:
- On PR opened, we recommend option 2: only build PRs if they don't take too long and otherwise disable ppc64le builds like we showcase in https://github.com/DSpace/dspace-angular/pull/1667
- Looking at above stats, we believe "not too long" holds for all builds <= 30 min.
- As soon as option 3 becomes available, migrate all workflows to this option: run everything natively and enable it for all PR opened.
- On PR merged, we recommend to always build all supported architectures.
- We also recommend to generally improve build performance by enabling caching during builds, which will generally lower building times by 30-40%: https://github.com/kubeflow/community/issues/779. If that is enabled, we will get more components under the 30 min. threshold.
@kimwnasptd, does that sound good? What do you think about the 30 min. threshold?
Still have to answer this:
From your experience, when is it most probable case for the x86 build to succeed but for the ppc64le to fail?
Seldomly happens once ppc64le support is there. The only case that is a bit harder are new 3rd-party dependencies, for example, additional python wheels unavailable on ppc64le (wheels are thus in Phase 3 of this endeavor). With Go/Java/JS code we typically don't see these kind of issues as they are more architecturally independent than then Python ecosystem.
@lehrig thangs for the detailed explanation! I agree with your proposal and rationale. So my current understanding is the following, but please tell me if I miss something:
- We can skip building multi-platform images when testing PRs, since we don't expect any issues
- Build for all architectures when a PR is merged, and GHA build and publish images
- Once we have native ppc64le support for GHA out-of-the-box we can migrate the workflows to use it
At the same time we can also work on caching in parallel https://github.com/kubeflow/community/issues/779. Also if we in the future we see that there are a lot of issues when building/pushing the images between architectures we can then come back to evaluating building multi-arch images during opened PRs.
Updated list of integrations by expanding phases & adding some smaller images for KServe + Katib. KFP is still moving slowly as it builds in another CI system, so we will first focus on KServe and Katib more.
Let's continue this discussion in the community repo. /transfer community
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Our team is quite actively working on this; please keep open.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Worked on following dependencies and components:
Triton: PR raised to enable support for Triton server with python backend 8329 ml-metadata: PR raised to build ml-metadata on Power with GCC-11 218 KFP: Frontend: PR Merged 12125