kubedl
kubedl copied to clipboard
[M5/Feature Request] Orchestrating Job Roles in DAG Scheduling Scheme.
In the practice of our prod environment, we found a portion of scenarios relies on scheduling job replicas in stages, otherwise there will be severe exceptions, for example:
- for PyTorchJob, if
Worker
s step intoRunning
phase beforeMaster
(Master
pod may hanging on pulling images or some other reasons), it will crash immediately becauseWorker
cannot pingMaster
successfully, then Job goes toFailed
. - for MPIJob, if
Launcher
step intoRunning
phase beforeWorker
s (Worker
pod may hanging on pulling images),Launcher
will exit unexpected becausekubectl exec
command can not reach target container in eachWorker
, similarly, Job finally failed.
in addition, DAG scheduling can also improve efficiency in certain scenarios, for example: schedule PS
before Worker
for TFJob
to reduce the duration of worker-stall;
@SimonCqk Good Idea! 👍 But for MPIJob, the init container of launcher pod will wait the worker pod to be running, it makes sure the Launcher will not step into Running phase before Workers. So you will use the DAG scheduling as a uniform method to solve the problem that you described. And use the DAG scheduling method to replace the init container of launcher pod for the mpijob situation?
But for MPIJob, the init container of launcher pod will wait the worker pod to be running, it makes sure the Launcher will not step into Running phase before Workers.
hi @HeGaoYuan , what you described is correct because the init container of MPI Launcher remains stalling to wait for Workers running , DAG can be a generic method to eliminate it. However, you raised a bug of this issue, I'll fix it ASAP.