Fix Prometheus metrics counter
What this PR does / why we need it:
Updates training_operator_jobs_successful_total metric properly.
Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2408
Checklist:
- [ ] Docs included if any changes are user facing
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Please review. Thankyou. /cc @andreyvelich @tenzen-y
Hi @izuku-sds — first of all, thanks for working on this! Just curious, does this fix the issue with training_operator_jobs_deleted_total (described in https://github.com/kubeflow/trainer/issues/2408) as well? Wanted to know if I should create a separate issue for it.
Hello @ishaan-mehta, This pr fixes:
training_operator_jobs_successful_totalmetric update- Earlier parameters similar to above were not visible from starting in
/metrics. They will be visible now.
Feel free to create separate issue for training_operator_jobs_deleted_total
currently I am trying to fix the tests which are not passed with this new update.
If In future following error causes some problem then given updated code could be used. Not changing it now.
Error:
2025-03-21T19:47:53Z ERROR Reconcile PyTorchJob error {"pytorchjob": {"name":"pytorch-ddp","namespace":"default"}, "error": "Operation cannot be fulfilled on pytorchjobs.kubeflow.org \"pytorch-ddp\": the object has been modified; please apply your changes to the latest version and try again"}
2025-03-21T19:47:53Z ERROR Reconciler error {"controller": "pytorchjob-controller", "object": {"name":"pytorch-ddp","namespace":"default"}, "namespace": "default", "name": "pytorch-ddp", "reconcileID": "1fbb4bf9-e2a3-4ba6-bce6-2e19efcc98a4", "error": "Operation cannot be fulfilled on pytorchjobs.kubeflow.org \"pytorch-ddp\": the object has been modified; please apply your changes to the latest version and try again"}
Updated code:
// UpdateJobStatusInApiServer updates the job status in to cluster.
func (r *PyTorchJobReconciler) UpdateJobStatusInApiServer(job interface{}, jobStatus *kubeflowv1.JobStatus) error {
if jobStatus.ReplicaStatuses == nil {
jobStatus.ReplicaStatuses = map[kubeflowv1.ReplicaType]*kubeflowv1.ReplicaStatus{}
}
pytorchjob, ok := job.(*kubeflowv1.PyTorchJob)
trainingoperatorcommon.ClearGeneratedFields(&pytorchjob.ObjectMeta)
if !ok {
return fmt.Errorf("%+v is not a type of PyTorchJob", job)
}
// Retry to update in case of conflict
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
// Get the latest version
latest := &kubeflowv1.PyTorchJob{}
if err := r.Get(context.Background(), types.NamespacedName{
Namespace: pytorchjob.Namespace,
Name: pytorchjob.Name,
}, latest); err != nil {
return err
}
// Check if the status needs to be updated.
if equality.Semantic.DeepEqual(&latest.Status, jobStatus) {
return nil
}
// Copy the updated status over.
latest.Status = *jobStatus.DeepCopy()
return r.Status().Update(context.Background(), latest)
})
if err != nil {
r.Log.WithValues("pytorchjob", types.NamespacedName{
Namespace: pytorchjob.GetNamespace(),
Name: pytorchjob.GetName(),
})
return err
}
return nil
}
/retest
/cc @Electronic-Waste
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This pull request has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.