mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

Launcher and workers should live on the same nodes

Open armandmcqueen opened this issue 4 years ago • 1 comments

Currently, when you are training two jobs on 2 GPU nodes, the launcher pods will not necessarily live on the same nodes as the associated worker. This makes scaling up and down quite difficult. If one job completes, you only need 1 GPU node, so you should be able to terminate one of the GPU nodes. However, that might kill the launcher of the job that is still running.

There is an open PR to change how launcher + workers are created. Would it be possible to solve this problem at the same time?

armandmcqueen avatar Jul 17 '19 21:07 armandmcqueen

Sorry, I cannot get you. Can you explain more? Do you mean we need gang scheduling or binpack scheduling to allow users to scaling down the cluster?

gaocegege avatar Jul 18 '19 01:07 gaocegege