czheng94
czheng94
Any update on this issue? I have experienced the same problem. Some workers went OOM, and the launcher failed in the end with its pod deleted by kube batch Job....
@terrytangyuan Yeah this is a separate issue. According to my understanding, if MPI doesn't report OOM from workers to launcher (which is probably the current status), StatefulSets will just directly...
Any updates on this issue? From my side, it seems like the workers are not even failing. An example failure case can be generated with this jobspec: ``` apiVersion: kubeflow.org/v1alpha2...
@terrytangyuan I meant **launcher** logs. mpirun doesn't emit any logs in the workers. The key is that failed launcher pod will be deleted by kube batch job after it's done...
Indeed this is expected behavior of kube batch job. If you set `restartPolicy = "OnFailure"` in the launcher pod template, all pods will be terminated and deleted if backoff limit...
Any thoughts on this issue? Some updates supporting the necessity of the Error ConditionType: Similar "Error" conditions do exist in other Kubernetes components, e.g. in Deployment and ReplicaSet. `ReplicaSet` has...
@terrytangyuan > but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller, I don't think we need...
@gaocegege elastic training is really cool! I'm not very familiar with torch elastic (but interested to learn more). Are we thinking about supporting a separate operator for pytorch elastic like...