kubedl icon indicating copy to clipboard operation
kubedl copied to clipboard

[M5/Feature Request] Orchestrating Job Roles in DAG Scheduling Scheme.

Open SimonCqk opened this issue 3 years ago • 2 comments

In the practice of our prod environment, we found a portion of scenarios relies on scheduling job replicas in stages, otherwise there will be severe exceptions, for example:

  1. for PyTorchJob, if Workers step into Running phase before Master (Master pod may hanging on pulling images or some other reasons), it will crash immediately because Worker cannot ping Master successfully, then Job goes to Failed.
  2. for MPIJob, if Launcher step into Running phase before Workers (Worker pod may hanging on pulling images), Launcher will exit unexpected because kubectl exec command can not reach target container in each Worker, similarly, Job finally failed.

in addition, DAG scheduling can also improve efficiency in certain scenarios, for example: schedule PS before Worker for TFJob to reduce the duration of worker-stall;

SimonCqk avatar Mar 22 '21 09:03 SimonCqk

@SimonCqk Good Idea! 👍 But for MPIJob, the init container of launcher pod will wait the worker pod to be running, it makes sure the Launcher will not step into Running phase before Workers. So you will use the DAG scheduling as a uniform method to solve the problem that you described. And use the DAG scheduling method to replace the init container of launcher pod for the mpijob situation?

HeGaoYuan avatar Jul 21 '21 14:07 HeGaoYuan

But for MPIJob, the init container of launcher pod will wait the worker pod to be running, it makes sure the Launcher will not step into Running phase before Workers.

hi @HeGaoYuan , what you described is correct because the init container of MPI Launcher remains stalling to wait for Workers running , DAG can be a generic method to eliminate it. However, you raised a bug of this issue, I'll fix it ASAP.

SimonCqk avatar Aug 15 '21 12:08 SimonCqk