training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

pytorchjob status.completionTime may be nil when backofflimit is nil

Open ycfnana opened this issue 3 years ago • 5 comments

when I start a wf like this:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorchjob-2022-07-28t20-49-16-v6
  namespace: xxx
  labels:
    test: test
spec:
  runPolicy:
    backoffLimit: 0
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
          - command:
            - /bin/bash
            - /start.sh
            image: xxx
            name: pytorch
            resources:
              limits:
                cpu: "0.1"
                memory: 0.1Gi
                nvidia.com/gpu: 0
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
          - command:
            - /bin/bash
            - /start.sh
            image: xxx
            name: pytorch
            resources:
              limits:
                cpu: "0.1"
                memory: 0.1Gi
                nvidia.com/gpu: 0

its status is this:

status:
  completionTime: "2022-07-29T11:41:19Z"
  replicaStatuses:
    Master:
      xxx
    Worker:
      xxx

but when backofflimit is nil, the completionTime will be nil. look for code for common,

if runPolicy.BackoffLimit != nil {
		jobHasNewFailure := failed > prevReplicasFailedNum
		// new failures happen when status does not reflect the failures and active
		// is different than parallelism, otherwise the previous controller loop
		// failed updating status so even if we pick up failure it is not a new one
		exceedsBackoffLimit = jobHasNewFailure && (active != totalReplicas) &&
			(int32(previousRetry)+1 > *runPolicy.BackoffLimit)

		pastBackoffLimit, err = jc.PastBackoffLimit(jobName, runPolicy, replicas, pods)
		if err != nil {
			return err
		}
	}

	if exceedsBackoffLimit || pastBackoffLimit {
		// check if the number of pod restart exceeds backoff (for restart OnFailure only)
		// OR if the number of failed jobs increased since the last syncJob
		jobExceedsLimit = true
		failureMessage = fmt.Sprintf("Job %s has failed because it has reached the specified backoff limit", jobName)
	} else if jc.PastActiveDeadline(runPolicy, jobStatus) {
		failureMessage = fmt.Sprintf("Job %s has failed because it was active longer than specified deadline", jobName)
		jobExceedsLimit = true
	}

and then jobExceedsLimit will be false when backofflimit is nil and pytorchjobs not expired. is it a bug? and the logic of check exceedsBackoffLimit, it used nums of requeue to judge if exceed limit, but it'll also requeue if reconcile has some error such as update conflict version. is it a bug too?

ycfnana avatar Jul 29 '22 11:07 ycfnana

I search code of pytorch controller, find this one

msg := fmt.Sprintf("PyTorchJob %s is failed because %d %s replica(s) failed.", pytorchjob.Name, failed, rtype)
r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobFailedReason, msg)
if pytorchjob.Status.CompletionTime == nil {
	now := metav1.Now()
	pytorchjob.Status.CompletionTime = &now
}
	err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobFailed, commonutil.JobFailedReason, msg)
if err != nil {
	commonutil.LoggerForJob(pytorchjob).Infof("Append job condition error: %v", err)
	return err
}

you set CompletionTime to pytorchjobs.status, and set condition, so I print these address like this:

if pytorchjob.Status.CompletionTime == nil {
        logrus.Error("now need to record completion time")
	now := metav1.Now()
	pytorchjob.Status.CompletionTime = &now
}
logrus.Errorf("jobstatus is %p and pytorchjobs.status is %p", jobStatus, &pytorchjob.Status)
logrus.Errorf("jobstatus is %v and pytorchjobs.status is %v", *jobStatus, pytorchjob.Status)
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobFailed, commonutil.JobFailedReason, msg)

the log is this:

ERRO[0094] jobstatus is 0xc0002c73c0 and pytorchjobs.status is 0xc00088db50 
ERRO[0094] jobstatus is {[{Created True PyTorchJobCreated PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is created. 2022-08-01 14:50:28 +0800 CST 2022-08-01 14:50:28 +0800 CST} {Running True JobRunning PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is running. 2022-08-01 14:50:30 +0800 CST 2022-08-01 14:50:30 +0800 CST}] map[Master:0xc00171f8a8 Worker:0xc00171fa58] <nil> <nil> <nil>} and pytorchjobs.status is {[{Created True PyTorchJobCreated PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is created. 2022-08-01 14:50:28 +0800 CST 2022-08-01 14:50:28 +0800 CST} {Running True JobRunning PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is running. 2022-08-01 14:50:30 +0800 CST 2022-08-01 14:50:30 +0800 CST}] map[Master:0xc00171f8a8 Worker:0xc00171fa58] <nil> 2022-08-01 14:52:30.728314 +0800 CST m=+94.364590187 <nil>} 

maybe its the reason of no completeion time when failed

ycfnana avatar Aug 01 '22 06:08 ycfnana

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 25 '23 00:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Sep 14 '23 00:09 github-actions[bot]

We should investigate the reasons and should fix this bug.

/reopen /lifecycle frozen

tenzen-y avatar Sep 14 '23 08:09 tenzen-y

@tenzen-y: Reopened this issue.

In response to this:

We should investigate the reasons and should fix this bug.

/reopen /lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Sep 14 '23 08:09 google-oss-prow[bot]