training-operator Fix Prometheus metrics counter

What this PR does / why we need it: Updates training_operator_jobs_successful_total metric properly.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2408

Checklist:

[ ] Docs included if any changes are user facing

Mar 21 '25 04:03 izuku-sds

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 21 '25 04:03 google-oss-prow[bot]

Please review. Thankyou. /cc @andreyvelich @tenzen-y

Mar 21 '25 04:03 izuku-sds

Hi @izuku-sds — first of all, thanks for working on this! Just curious, does this fix the issue with training_operator_jobs_deleted_total (described in https://github.com/kubeflow/trainer/issues/2408) as well? Wanted to know if I should create a separate issue for it.

Mar 21 '25 15:03 ishaan-mehta

Hello @ishaan-mehta, This pr fixes:

training_operator_jobs_successful_total metric update
Earlier parameters similar to above were not visible from starting in /metrics. They will be visible now.

Feel free to create separate issue for training_operator_jobs_deleted_total currently I am trying to fix the tests which are not passed with this new update.

Mar 21 '25 16:03 izuku-sds

If In future following error causes some problem then given updated code could be used. Not changing it now.

Error:

2025-03-21T19:47:53Z	ERROR	Reconcile PyTorchJob error	{"pytorchjob": {"name":"pytorch-ddp","namespace":"default"}, "error": "Operation cannot be fulfilled on pytorchjobs.kubeflow.org \"pytorch-ddp\": the object has been modified; please apply your changes to the latest version and try again"}
2025-03-21T19:47:53Z	ERROR	Reconciler error	{"controller": "pytorchjob-controller", "object": {"name":"pytorch-ddp","namespace":"default"}, "namespace": "default", "name": "pytorch-ddp", "reconcileID": "1fbb4bf9-e2a3-4ba6-bce6-2e19efcc98a4", "error": "Operation cannot be fulfilled on pytorchjobs.kubeflow.org \"pytorch-ddp\": the object has been modified; please apply your changes to the latest version and try again"}

Updated code:

// UpdateJobStatusInApiServer updates the job status in to cluster.
func (r *PyTorchJobReconciler) UpdateJobStatusInApiServer(job interface{}, jobStatus *kubeflowv1.JobStatus) error {
	if jobStatus.ReplicaStatuses == nil {
		jobStatus.ReplicaStatuses = map[kubeflowv1.ReplicaType]*kubeflowv1.ReplicaStatus{}
	}

	pytorchjob, ok := job.(*kubeflowv1.PyTorchJob)
	trainingoperatorcommon.ClearGeneratedFields(&pytorchjob.ObjectMeta)
	if !ok {
		return fmt.Errorf("%+v is not a type of PyTorchJob", job)
	}

	// Retry to update in case of conflict
	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
		// Get the latest version
		latest := &kubeflowv1.PyTorchJob{}
		if err := r.Get(context.Background(), types.NamespacedName{
			Namespace: pytorchjob.Namespace,
			Name:      pytorchjob.Name,
		}, latest); err != nil {
			return err
		}

		// Check if the status needs to be updated.
		if equality.Semantic.DeepEqual(&latest.Status, jobStatus) {
			return nil
		}
		// Copy the updated status over.
		latest.Status = *jobStatus.DeepCopy()
		return r.Status().Update(context.Background(), latest)
	})

	if err != nil {
		r.Log.WithValues("pytorchjob", types.NamespacedName{
			Namespace: pytorchjob.GetNamespace(),
			Name:      pytorchjob.GetName(),
		})
		return err
	}

	return nil
}

Mar 21 '25 19:03 izuku-sds

/retest

Mar 22 '25 00:03 izuku-sds

/cc @Electronic-Waste

Mar 24 '25 03:03 izuku-sds

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 22 '25 05:06 github-actions[bot]

This pull request has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Jul 12 '25 05:07 github-actions[bot]