mpi-operator
mpi-operator copied to clipboard
[feature request] Support elastic
https://github.com/horovod/horovod/blob/master/docs/elastic.rst
It will be better if we support elastic training.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/feature | 0.98 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
/cc @zw0610 @carmark
That's interesting. Has any of you tried it out yet? We'll need some refactoring on the launcher logic and then support horovodrun
which seems pretty similar to mpirun
:
horovodrun -np 8 --host-discovery-script discover_hosts.sh --slots 4 python train.py
Tried it locally, not on k8s. We should handle discover_hosts.sh for it if we want to support it.
A simple idea for discover_hosts.sh
will be opening a server on mpi-operator, exposing the status of the corresponding mpi-job and allowing pods from querying the status from the server. However, I'm not sure whether such idea is widely seen on Kubeflow or Kubernetes.
Are there any shortcuts we can exploit from the StatefulSet features so no pod-operator communication is needed?
I believe we discussed using ConfigMap to store and update the status of all pods in a StatefulSet. The concern comes from the latency of ConfigMap.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/front-end | 0.72 |
area/operator | 0.54 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
how to pass horovodrun's parameters like --host-discovery-script and --min-np when using mpirun command?
Do you want to use it in mpijob or just in horovod?
@gaocegege want to use it in mpijob. I tried like this
mpirun --allow-run-as-root -np 1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow2_mnist_elastic.py --host-discovery-script ./discover_hosts.sh --min-np 1
not failed, but --host-discovery-script and --min-np not work. BTW, is mpijob v1/v1alpha2 support horovod job just?
#332 is working on this issue.
TODO list:
- [ ] Unit test for elastic