mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

[feature request] Support elastic

Open gaocegege opened this issue 4 years ago • 11 comments

https://github.com/horovod/horovod/blob/master/docs/elastic.rst

It will be better if we support elastic training.

gaocegege avatar Jul 07 '20 05:07 gaocegege

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.98

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jul 07 '20 05:07 issue-label-bot[bot]

/cc @zw0610 @carmark

gaocegege avatar Jul 07 '20 06:07 gaocegege

That's interesting. Has any of you tried it out yet? We'll need some refactoring on the launcher logic and then support horovodrun which seems pretty similar to mpirun:

horovodrun -np 8 --host-discovery-script discover_hosts.sh --slots 4 python train.py

terrytangyuan avatar Jul 07 '20 11:07 terrytangyuan

Tried it locally, not on k8s. We should handle discover_hosts.sh for it if we want to support it.

gaocegege avatar Jul 13 '20 03:07 gaocegege

A simple idea for discover_hosts.sh will be opening a server on mpi-operator, exposing the status of the corresponding mpi-job and allowing pods from querying the status from the server. However, I'm not sure whether such idea is widely seen on Kubeflow or Kubernetes.

Are there any shortcuts we can exploit from the StatefulSet features so no pod-operator communication is needed?


I believe we discussed using ConfigMap to store and update the status of all pods in a StatefulSet. The concern comes from the latency of ConfigMap.

zw0610 avatar Jul 14 '20 01:07 zw0610

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.72
area/operator 0.54

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jul 14 '20 01:07 issue-label-bot[bot]

how to pass horovodrun's parameters like --host-discovery-script and --min-np when using mpirun command?

qifengz avatar Jan 18 '21 10:01 qifengz

Do you want to use it in mpijob or just in horovod?

gaocegege avatar Jan 19 '21 06:01 gaocegege

@gaocegege want to use it in mpijob. I tried like this

mpirun --allow-run-as-root -np 1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow2_mnist_elastic.py --host-discovery-script ./discover_hosts.sh --min-np 1

not failed, but --host-discovery-script and --min-np not work. BTW, is mpijob v1/v1alpha2 support horovod job just?

qifengz avatar Jan 19 '21 08:01 qifengz

#332 is working on this issue.

gaocegege avatar Mar 02 '21 10:03 gaocegege

TODO list:

  • [ ] Unit test for elastic

gaocegege avatar Mar 02 '21 10:03 gaocegege