mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

launcher has gone but not workers

Open dragonofac opened this issue 5 years ago • 12 comments

image

After delete mpijob, launcher has been deleted but not the workers. How could it happen?

dragonofac avatar Jul 18 '19 12:07 dragonofac

yes, I met the same problem. Actually if you wait for 5-6 min after launcher finishing the job and you can see the worker will be terminated automatically.
I just wonder the time setting of this and couldn't get the clue from the mpi-operator(mpi controller)code, could someone point out the exact location consider this setting, thanks! TIM截图20190719211237

Winowang avatar Jul 19 '19 13:07 Winowang

You need to set cleanPodPolicy to clean up the worker pods. See https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml#L7.

rongou avatar Jul 19 '19 16:07 rongou

You need to set cleanPodPolicy to clean up the worker pods. See https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml#L7.

Thanks for the advice, but the issue is the worker can't stop immediately when launcher is completed. It's about 5 min delay. It's the default setting of the kubernetes or we can change this in the operatpor? Thks.

Winowang avatar Jul 22 '19 01:07 Winowang

@Winowang What's the status of the pods when the launcher is completed? Is it in terminating or running?

If it is in terminating and takes about 5 mins, then we can do nothing.

gaocegege avatar Jul 22 '19 01:07 gaocegege

@Winowang What's the status of the pods when the launcher is completed? Is it in terminating or running?

If it is in terminating and takes about 5 mins, then we can do nothing. image hi, thks for the reply, Just as the img above, the launcher is already in the completed status but the workers are still in running (about 5 min).
Actually it's the slow mpijob status updating. After mpijob updating to the Succeeded status, then the workers are going to be terminated.

Winowang avatar Jul 23 '19 02:07 Winowang

Then I will take a look

/cc @terrytangyuan @wackxu

gaocegege avatar Jul 23 '19 02:07 gaocegege

/assign

gaocegege avatar Jul 23 '19 02:07 gaocegege

/assign @wuchunghsuan

gaocegege avatar Jul 23 '19 02:07 gaocegege

Actually it's the slow mpijob status updating. After mpijob updating to the successed status, then the workers are going to be terminated.

@Winowang I have test some example in my local environment and can not reproduce it. When launcher is completed, what is the status of mpijob? Running? and after about 5 min, become to succedd? and what cleanPodPolicy you have set ?

wackxu avatar Jul 23 '19 04:07 wackxu

hi all , just a addition, the situation I mentioned above just happened when launcher and worker were on the same node ( like launcher and worker-1 were both on the node1, then this mpijob updating would be very slow) . Thks.

Actually it's the slow mpijob status updating. After mpijob updating to the successed status, then the workers are going to be terminated.

@Winowang I have test some example in my local environment and can not reproduce it. When launcher is completed, what is the status of mpijob? Running? and after about 5 min, become to succedd? and what cleanPodPolicy you have set ?

when launcher is completed, the status of mpijob is still in runnig. And about 5 min it became to Succeeded. I tested several times and it happened randomly... thks

Winowang avatar Jul 23 '19 07:07 Winowang

Hi all, I think I find the reason.
The "defaultControllerRateLimiter" causes this . The default setting is "NewItemExponentialFailureRateLimiter(5time.Millisecond, 1000time.Second)". So if the job takes a long time to finish and the period(or frequency ) to query the mpijob status will become longer ( lower).
Thanks.

Winowang avatar Jul 24 '19 00:07 Winowang

Hi. I got the same issue in training-operator. It seems like it took workers hours to be completed

Same situations. If workers are not on the same nodes, it will complete once launcher got completed. Otherwise workers will take about 40-50mins to complete

Hi all, I think I find the reason. The "defaultControllerRateLimiter" causes this . The default setting is "NewItemExponentialFailureRateLimiter(5_time.Millisecond, 1000_time.Second)". So if the job takes a long time to finish and the period(or frequency ) to query the mpijob status will become longer ( lower). Thanks.

cheimu avatar Mar 12 '22 13:03 cheimu