kundan kumar

Results 22 comments of kundan kumar

Hello @RasButAss , are you still working on this issue? Would love to work on this issue.

Hi @andreyvelich , due to other commitments, I’m currently unable to continue working on this issue. I’d be happy for @kris-gaudel to take it over. Some initial work has been...

I would like to work on this issue. /assign

@andreyvelich @Electronic-Waste > Training Operator version: > > ``` > $ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}" > ***.azurecr.io/kubeflow/training-operator:v1-5a5f92d > ``` > ``` > # Start PyTorchJob...

Tried with following configuration found similar issue and behavior of `training_operator_jobs_successful_total` still not reliable. same jobs when run multiple times, increment is found to be different. @andreyvelich could you clarify...

Increment in `training_operator_jobs_successful_total` metric is unpredictable due to the condition which is used to determine whether it is master replica. `expected == 0` condition is insufficient in itself. Ideally we...

Also a race condition here. https://github.com/kubeflow/trainer/blob/5840e816e2cc1ef9b65064fa3e245add4cf9be25/pkg/controller.v1/pytorch/pytorchjob_controller.go#L475 alternative code: ``` patch := client.MergeFrom(pytorchjob.DeepCopy()) result := r.Status().Patch(context.Background(), pytorchjob, patch) ```

@zzmao @HumairAK Is this issue still being worked on, or can I take it up?

Is this issue resolved or still open? (as latest attempt to run this issue is successful) If this issue is still not resolved, Please mention the steps to reproduce it...