czheng94 comments

Results 8 comments of


                                            czheng94

Launcher and worker statuses do not correctly indicate the underlying states

Any update on this issue? I have experienced the same problem. Some workers went OOM, and the launcher failed in the end with its pod deleted by kube batch Job....

Launcher and worker statuses do not correctly indicate the underlying states

@terrytangyuan Yeah this is a separate issue. According to my understanding, if MPI doesn't report OOM from workers to launcher (which is probably the current status), StatefulSets will just directly...

Launcher and worker statuses do not correctly indicate the underlying states

Any updates on this issue? From my side, it seems like the workers are not even failing. An example failure case can be generated with this jobspec: ``` apiVersion: kubeflow.org/v1alpha2...

Can't access launcher logs after MpiJob fails

@terrytangyuan I meant **launcher** logs. mpirun doesn't emit any logs in the workers. The key is that failed launcher pod will be deleted by kube batch job after it's done...

Can't access launcher logs after MpiJob fails

Indeed this is expected behavior of kube batch job. If you set `restartPolicy = "OnFailure"` in the launcher pod template, all pods will be terminated and deleted if backoff limit...

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status

Any thoughts on this issue? Some updates supporting the necessity of the Error ConditionType: Similar "Error" conditions do exist in other Kubernetes components, e.g. in Deployment and ReplicaSet. `ReplicaSet` has...

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status

@terrytangyuan > but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller, I don't think we need...

[feature] Rethink distributed Pytorch backoff retry

@gaocegege elastic training is really cool! I'm not very familiar with torch elastic (but interested to learn more). Are we thinking about supporting a separate operator for pytorch elastic like...