mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

mpijob restarts a few hours after launcher completed.

Open ustcyue opened this issue 6 years ago • 6 comments

My job definition is like this.

apiVersion: "kubeflow.org/v1alpha1"
kind: "MPIJob"
metadata:
  name: {{ job_name }}
  labels:
    exp_name: {{ exp_name }}
    user: {{ user_name }}

spec:
  backoffLimit: 0
  ......
      restartPolicy: Never

After the launcher job finished, either Failed or Succeed, all worker pods terminated normally. However, after around three hours, the whole job automatically restarts. Is this expected? Should I delete the mpijob after each run?

ustcyue avatar Jan 31 '19 03:01 ustcyue

That's strange. Did you do anything with the cluster during those 3 hours?

rongou avatar Jan 31 '19 16:01 rongou

@rongou no... I didn't do anything. Actually it is not fixed at 3 hours, sometimes it will restart after 1 day, seems like a random issue. My kubernetes version is 1.10.5

ustcyue avatar Feb 01 '19 06:02 ustcyue

Do you have the logs from mpi-operator?

rongou avatar Feb 01 '19 06:02 rongou

@rongou I cannot find any related log from mpi-operator, maybe because the log level is too high. May I know how could I set the log level to INFO?

ustcyue avatar Feb 12 '19 02:02 ustcyue

@ustcyue You probably need to pass -alsologtostderr here: https://github.com/kubeflow/mpi-operator/blob/071a9bcfbad15b86c62fd0c418478ba408371867/deploy/3-mpi-operator.yaml#L23

terrytangyuan avatar Feb 12 '19 02:02 terrytangyuan

@terrytangyuan I added the option to mpi-operator, bust still I cannot find any meaningful log at the time of automatic restart.

ustcyue avatar Feb 18 '19 02:02 ustcyue