kundan kumar comments

Results 22 comments of


                                            kundan kumar

Add a workflow for publishing Helm charts

Hello @RasButAss , are you still working on this issue? Would love to work on this issue.

Add a workflow for publishing Helm charts

/assign

Add a workflow for publishing Helm charts

Hi @andreyvelich , due to other commitments, I’m currently unable to continue working on this issue. I’d be happy for @kris-gaudel to take it over. Some initial work has been...

Some Prometheus metrics not being reported properly

I would like to work on this issue. /assign

Some Prometheus metrics not being reported properly

@andreyvelich @Electronic-Waste > Training Operator version: > > ``` > $ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}" > ***.azurecr.io/kubeflow/training-operator:v1-5a5f92d > ``` > ``` > # Start PyTorchJob...

Some Prometheus metrics not being reported properly

Tried with following configuration found similar issue and behavior of `training_operator_jobs_successful_total` still not reliable. same jobs when run multiple times, increment is found to be different. @andreyvelich could you clarify...

Some Prometheus metrics not being reported properly

Increment in `training_operator_jobs_successful_total` metric is unpredictable due to the condition which is used to determine whether it is master replica. `expected == 0` condition is insufficient in itself. Ideally we...

Some Prometheus metrics not being reported properly

Also a race condition here. https://github.com/kubeflow/trainer/blob/5840e816e2cc1ef9b65064fa3e245add4cf9be25/pkg/controller.v1/pytorch/pytorchjob_controller.go#L475 alternative code: ``` patch := client.MergeFrom(pytorchjob.DeepCopy()) result := r.Status().Patch(context.Background(), pytorchjob, patch) ```

[feature] Cache Key Customization

@zzmao @HumairAK Is this issue still being worked on, or can I take it up?

Flaky Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0]

Is this issue resolved or still open? (as latest attempt to run this issue is successful) If this issue is still not resolved, Please mention the steps to reproduce it...