mpi-operator
mpi-operator copied to clipboard
launcher has gone but not workers
After delete mpijob, launcher has been deleted but not the workers. How could it happen?
yes, I met the same problem. Actually if you wait for 5-6 min after launcher finishing the job and you can see the worker will be terminated automatically.
I just wonder the time setting of this and couldn't get the clue from the mpi-operator(mpi controller)code, could someone point out the exact location consider this setting, thanks!
You need to set cleanPodPolicy
to clean up the worker pods. See https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml#L7.
You need to set
cleanPodPolicy
to clean up the worker pods. See https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml#L7.
Thanks for the advice, but the issue is the worker can't stop immediately when launcher is completed. It's about 5 min delay. It's the default setting of the kubernetes or we can change this in the operatpor? Thks.
@Winowang What's the status of the pods when the launcher is completed? Is it in terminating or running?
If it is in terminating and takes about 5 mins, then we can do nothing.
@Winowang What's the status of the pods when the launcher is completed? Is it in terminating or running?
If it is in terminating and takes about 5 mins, then we can do nothing.
hi, thks for the reply, Just as the img above, the launcher is already in the completed status but the workers are still in running (about 5 min).
Actually it's the slow mpijob status updating. After mpijob updating to the Succeeded status, then the workers are going to be terminated.
Then I will take a look
/cc @terrytangyuan @wackxu
/assign
/assign @wuchunghsuan
Actually it's the slow mpijob status updating. After mpijob updating to the successed status, then the workers are going to be terminated.
@Winowang I have test some example in my local environment and can not reproduce it. When launcher is completed, what is the status of mpijob? Running? and after about 5 min, become to succedd? and what cleanPodPolicy you have set ?
hi all , just a addition, the situation I mentioned above just happened when launcher and worker were on the same node ( like launcher and worker-1 were both on the node1, then this mpijob updating would be very slow) . Thks.
Actually it's the slow mpijob status updating. After mpijob updating to the successed status, then the workers are going to be terminated.
@Winowang I have test some example in my local environment and can not reproduce it. When launcher is completed, what is the status of mpijob? Running? and after about 5 min, become to succedd? and what cleanPodPolicy you have set ?
when launcher is completed, the status of mpijob is still in runnig. And about 5 min it became to Succeeded. I tested several times and it happened randomly... thks
Hi all,
I think I find the reason.
The "defaultControllerRateLimiter" causes this . The default setting is "NewItemExponentialFailureRateLimiter(5time.Millisecond, 1000time.Second)".
So if the job takes a long time to finish and the period(or frequency ) to query the mpijob status will become longer ( lower).
Thanks.
Hi. I got the same issue in training-operator. It seems like it took workers hours to be completed
Same situations. If workers are not on the same nodes, it will complete once launcher got completed. Otherwise workers will take about 40-50mins to complete
Hi all, I think I find the reason. The "defaultControllerRateLimiter" causes this . The default setting is "NewItemExponentialFailureRateLimiter(5_time.Millisecond, 1000_time.Second)". So if the job takes a long time to finish and the period(or frequency ) to query the mpijob status will become longer ( lower). Thanks.