mpi-operator
mpi-operator copied to clipboard
mpijob restarts a few hours after launcher completed.
My job definition is like this.
apiVersion: "kubeflow.org/v1alpha1"
kind: "MPIJob"
metadata:
name: {{ job_name }}
labels:
exp_name: {{ exp_name }}
user: {{ user_name }}
spec:
backoffLimit: 0
......
restartPolicy: Never
After the launcher job finished, either Failed or Succeed, all worker pods terminated normally. However, after around three hours, the whole job automatically restarts. Is this expected? Should I delete the mpijob after each run?
That's strange. Did you do anything with the cluster during those 3 hours?
@rongou no... I didn't do anything. Actually it is not fixed at 3 hours, sometimes it will restart after 1 day, seems like a random issue. My kubernetes version is 1.10.5
Do you have the logs from mpi-operator
?
@rongou I cannot find any related log from mpi-operator
, maybe because the log level is too high. May I know how could I set the log level to INFO?
@ustcyue You probably need to pass -alsologtostderr
here: https://github.com/kubeflow/mpi-operator/blob/071a9bcfbad15b86c62fd0c418478ba408371867/deploy/3-mpi-operator.yaml#L23
@terrytangyuan I added the option to mpi-operator, bust still I cannot find any meaningful log at the time of automatic restart.