kubedl icon indicating copy to clipboard operation
kubedl copied to clipboard

refactor(mpi): remove global launcherRunsWorkload flag.

Open SimonCqk opened this issue 3 years ago • 9 comments

Signed-off-by: SimonCqk [email protected]

Ⅰ. Describe what this PR does

remove launcherRunsWorkload global startup flag, which can be inferred by whether has Launcher role in job spec.

II. Does this pull request fix one issue?

fix #194

III. Special notes for reviewers if any.

SimonCqk avatar Oct 21 '21 12:10 SimonCqk

Codecov Report

Merging #198 (d1cbc96) into master (0eb96cd) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #198   +/-   ##
=======================================
  Coverage   23.18%   23.18%           
=======================================
  Files          75       75           
  Lines        4502     4502           
=======================================
  Hits         1044     1044           
  Misses       3319     3319           
  Partials      139      139           
Flag Coverage Δ
unittests 23.18% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0eb96cd...d1cbc96. Read the comment docs.

codecov-commenter avatar Oct 21 '21 12:10 codecov-commenter

does this mean the launcher pod always run the workload ? is this expected ?

jian-he avatar Oct 22 '21 00:10 jian-he

does this mean the launcher pod always run the workload ? is this expected ?

if launcher role is in job spec, then it is, otherwise workers should tigger workload running.

SimonCqk avatar Oct 25 '21 16:10 SimonCqk

does this mean the launcher pod always run the workload ? is this expected ?

if launcher role is in job spec, then it is, otherwise workers should tigger workload running.

then, why do we need the launcher pod in the first place, if it runs the same as worker..

jian-he avatar Oct 26 '21 05:10 jian-he

anyway, the patch looks ok

jian-he avatar Oct 26 '21 05:10 jian-he

then, why do we need the launcher pod in the first place, if it runs the same as worker..

Launcher pod execution scripts are generated by controller and it triggers training by hooking scripts in OpenMPI framework extension points, if only Worker roles in job spec, user has to handle it manually.

SimonCqk avatar Oct 26 '21 12:10 SimonCqk

does this mean the launcher pod always run the workload ? is this expected ?

if launcher role is in job spec, then it is, otherwise workers should tigger workload running.

I don't think it is ok. Because for the mpijob, we always have the launcher role in job spec.

HeGaoYuan avatar Oct 26 '21 15:10 HeGaoYuan

does this mean the launcher pod always run the workload ? is this expected ?

if launcher role is in job spec, then it is, otherwise workers should tigger workload running.

I don't think it is ok. Because for the mpijob, we always have the launcher role in job spec.

yes, that's what I mean, normally mpijob will be driven by launcher commands. In that case, should launcherRunsWorkload always be true?

SimonCqk avatar Oct 26 '21 15:10 SimonCqk

does this mean the launcher pod always run the workload ? is this expected ?

if launcher role is in job spec, then it is, otherwise workers should tigger workload running.

I don't think it is ok. Because for the mpijob, we always have the launcher role in job spec.

yes, that's what I mean, normally mpijob will be driven by launcher commands. In that case, should launcherRunsWorkload always be true?

In my opinion, launcherRunsWorkload should not always be true.

HeGaoYuan avatar Nov 09 '21 05:11 HeGaoYuan