community icon indicating copy to clipboard operation
community copied to clipboard

Umbrella Issue: Porting Kubeflow to IBM Power (ppc64le)

Open lehrig opened this issue 2 years ago • 19 comments

/kind feature

Enable builds & releases for IBM Power (ppc64le architecture). This proposal was presented with these slides at the 2022-10-25 Kubeflow community call with positive community feedback. We also created this design documentation: https://docs.google.com/document/d/1nGUvLonahoLogfWCHsoUOZl-s77YtPEiCjWBVlZjJHo/edit?usp=sharing

Why you need this feature:

  • Widen scope of possible on-premises deployments (vanilla Kubernetes & OpenShift on Power)
  • More general independence regarding processor architecture (x86, ppc64le, arm, …)
  • Unified container builds

Describe the solution you'd like:

  • Upstreaming changes that allow to build Dockerfiles on multiple architecture (starting with x86 & ppc64le)
  • Upstreaming CI integration for multi-arch builds (starting with x86 & ppc64le)

We currently plan to divide our efforts into multiply phases:

  1. low-hanging "easy" integrations where no or minor code changes are needed; excluding KFP; Kubeflow 1.7 release scope (✅ done),
  2. same as 1. but now including additional KServe components for model serving; Kubeflow 1.8 release scope,
  3. same as 1. but now including KFP; Kubeflow 1.9 release scope,
  4. more complex integrations where external dependencies to python wheels exist.

Below is a detailed overview of each required integration, including links to associated PRs if those already exist.

Phase 1 Integrations (Kubeflow 1.7 scope)

  • [x] Poddefaults (Admission) Webhook: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6803 🚀 https://hub.docker.com/r/kubeflownotebookswg/poddefaults-webhook/tags
  • [x] Central Dashboard: https://github.com/kubeflow/kubeflow/pull/6861, https://github.com/kubeflow/kubeflow/pull/6923 🚀 https://hub.docker.com/r/kubeflownotebookswg/centraldashboard/tags
  • [x] Jupyter Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6800 🚀 https://hub.docker.com/r/kubeflownotebookswg/jupyter-web-app/tags
  • [x] KServe: Agent: https://github.com/kserve/kserve/pull/2476, https://github.com/kserve/kserve/pull/2549 🚀 https://hub.docker.com/r/kserve/agent/tags
  • [x] KServe: Controller: https://github.com/kserve/kserve/pull/2476, https://github.com/kserve/kserve/pull/2550 🚀 https://hub.docker.com/r/kserve/kserve-controller/tags
  • [x] KServe: Models Web App: https://github.com/kserve/models-web-app/pull/45, https://github.com/kserve/models-web-app/pull/55 🚀 https://hub.docker.com/r/kserve/models-web-app/tags
  • [x] KServe: QPExt: https://github.com/kserve/kserve/pull/2604 🚀 https://hub.docker.com/r/kserve/qpext/tags
  • [x] KServe: Router: https://github.com/kserve/kserve/pull/2605 🚀 https://hub.docker.com/r/kserve/router/tags
  • [x] MPI Operator: https://github.com/kubeflow/mpi-operator/pull/489 🚀 https://hub.docker.com/r/mpioperator/mpi-operator/tags
  • [x] Notebook Controller: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6771 🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
  • [x] Profiles + KFAM: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6785, https://github.com/kubeflow/kubeflow/pull/6809 🚀 https://hub.docker.com/r/kubeflownotebookswg/profile-controller/tags 🚀 https://hub.docker.com/r/kubeflownotebookswg/kfam/tags
  • [x] Tensorboard Controller: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6805 🚀 https://hub.docker.com/r/kubeflownotebookswg/notebook-controller/tags
  • [x] Tensorboard Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6810 🚀 https://hub.docker.com/r/kubeflownotebookswg/tensorboards-web-app/tags
  • [x] Training Operator: https://github.com/kubeflow/training-operator/pull/1674, https://github.com/kubeflow/training-operator/pull/1692 🚀 https://hub.docker.com/r/kubeflow/training-operator/tags
  • [x] Volumes Web App: https://github.com/kubeflow/kubeflow/pull/6650, https://github.com/kubeflow/kubeflow/pull/6811 🚀 https://hub.docker.com/r/kubeflownotebookswg/volumes-web-app/tags

Phase 2 Integrations (Kubeflow 1.9 scope)

  • [ ] KServe: PMML Server
  • [ ] KServe: AIX
  • [ ] KServe: Alibi
  • [ ] KServe: Art
  • [ ] Triton Inference Server (external)
  • [ ] Seldon: ML Server (external)
  • [ ] PyTorch: TorchServe (external)

Phase 3 Integrations (Kubeflow 1.10 scope)

Note: KFP is currently blocked by https://github.com/kubeflow/pipelines/issues/8660 / https://github.com/GoogleCloudPlatform/oss-test-infra/issues/1972

  • [ ] KFP: Application-CRD-Controller
  • [ ] KFP: Argoexec
  • [ ] KFP: Cache-Server
  • [ ] KFP: Frontend
  • [ ] KFP: Metadata Envoy
  • [ ] KFP: Persistence Agent
  • [ ] KFP: Scheduled Workflow
  • [ ] KFP: Workflow Controller
  • [ ] KFP: Viewer-CRD-Controller
  • [ ] KServe: LGB Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] KServe: Paddle Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] KServe: SKLearn Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] KServe: XGB Server: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] Katib: controller, db-manager, ui
  • [ ] Katib: file-metrics-collector
  • [ ] Katib: tfevent-metrics-collector
  • [ ] Katib: suggestion-hyperopt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-chocolate
  • [ ] Katib: suggestion-hyperband: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-skopt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-goptuna
  • [ ] Katib: suggestion-optuna: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-enas
  • [ ] Katib: suggestion-darts: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: suggestion-pbt: https://github.com/kubeflow/katib/pull/2262, https://github.com/kubeflow/katib/pull/2290
  • [ ] Katib: earlystopping-medianstop: https://github.com/kubeflow/katib/pull/2290

Phase 4 Integrations (Post Kubeflow 1.11 scope)

  • [ ] KFP: Api Server
  • [ ] KFP: Metadata Writer
  • [ ] KFP: Visualization Server
  • [ ] ml-metadata (KFP wheel dep.): https://github.com/google/ml-metadata/pull/171
  • [ ] KServe: Storage Initializer: blocked by https://github.com/pyca/cryptography/issues/7723
  • [ ] ~~OIDC Auth (external): https://github.com/arrikto/oidc-authservice/issues/104; on-hold as potentially irrelevant as of Kubeflow v1.8 (https://github.com/kubeflow/manifests/issues/2469)~~

lehrig avatar Oct 25 '22 18:10 lehrig