mpi-operator
mpi-operator copied to clipboard
Launcher and workers should live on the same nodes
Currently, when you are training two jobs on 2 GPU nodes, the launcher pods will not necessarily live on the same nodes as the associated worker. This makes scaling up and down quite difficult. If one job completes, you only need 1 GPU node, so you should be able to terminate one of the GPU nodes. However, that might kill the launcher of the job that is still running.
There is an open PR to change how launcher + workers are created. Would it be possible to solve this problem at the same time?
Sorry, I cannot get you. Can you explain more? Do you mean we need gang scheduling or binpack scheduling to allow users to scaling down the cluster?