mpi-operator
mpi-operator copied to clipboard
Implement v2 controller that sets up SSH for communication
Implementation for https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md
- Core Features:
- [x] Replace kubectl and support roles with ssh and secrets #375
- [x] Add support for running as non-root #383
- [x] Ensure retries in case of startup delays #386
- [x] Ensure all container image examples support the SSH setup
- [x] Support Intel MPI #389
- Production readiness
- [x] Upgrade k8s dependencies #370
- [x] Add integration tests #375
- [x] Add E2E tests #399 #403
- [x] Add API defaulting and validation #376
- [x] Update API to apiextensions.k8s.io/v1 for support in k8s 1.22 #379
- [x] Use fully-qualified label names #409
- [ ] Developer documentation #414
- [ ] End-user documentation
- [ ] Graduate API from v2beta1 to v2 once matured.
- Controller simplifications: delegating pod management to core k8s APIs
- [ ] Replace the plain pod workers with StatefulSet or Indexed Job
- [x] Replace the plain pod launcher with a Job
https://www.kubeflow.org/docs/about/contributing/#joining-the-kubeflow-github-org
Hi, could you please join the kubeflow org? Then we do not need to trigger the CICD for your PR manually.
Sent PR kubeflow/internal-acls#473
Thanks for the suggestion
I verified that images docker.io/kubeflow/mpi-horovod-mnist and docker.io/mpioperator/tensorflow-benchmarks just work with the new controller. Marking that as done.
@alculquicondor Has community discussed tradeoffs about job vs pod for launcher, statefulsets vs plain pods for workers?
Yes for launcher. See the discussion here #386
For workers, it's still open for discussion. We could do Statefulsets, but I think plain pods might be fine for now. We might migrate to Indexed Jobs at some point, but since it's only available in k8s 1.22, it's kind of early to discuss.
I think this is pretty much ready. The last things I would like to do are:
- update the labels to use fully-qualified names. For that, I'm waiting for a release of the common library kubeflow/common#153
- Add documentation (is there a website, or should I just do it on readmes)?
* Add documentation (is there a website, or should I just do it on readmes)?
There's this page https://www.kubeflow.org/docs/components/training/mpi/
Maybe we can introduce Indexed Job to mpi-operator v2 once https://github.com/kubernetes/enhancements/issues/3715 is graduated to beta.
Consider introducing JobSet instead of managing raw pods for the workers: https://github.com/kubernetes-sigs/jobset