training-operator
training-operator copied to clipboard
pytorchjob status.completionTime may be nil when backofflimit is nil
when I start a wf like this:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorchjob-2022-07-28t20-49-16-v6
namespace: xxx
labels:
test: test
spec:
runPolicy:
backoffLimit: 0
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- command:
- /bin/bash
- /start.sh
image: xxx
name: pytorch
resources:
limits:
cpu: "0.1"
memory: 0.1Gi
nvidia.com/gpu: 0
Worker:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- command:
- /bin/bash
- /start.sh
image: xxx
name: pytorch
resources:
limits:
cpu: "0.1"
memory: 0.1Gi
nvidia.com/gpu: 0
its status is this:
status:
completionTime: "2022-07-29T11:41:19Z"
replicaStatuses:
Master:
xxx
Worker:
xxx
but when backofflimit is nil, the completionTime will be nil. look for code for common,
if runPolicy.BackoffLimit != nil {
jobHasNewFailure := failed > prevReplicasFailedNum
// new failures happen when status does not reflect the failures and active
// is different than parallelism, otherwise the previous controller loop
// failed updating status so even if we pick up failure it is not a new one
exceedsBackoffLimit = jobHasNewFailure && (active != totalReplicas) &&
(int32(previousRetry)+1 > *runPolicy.BackoffLimit)
pastBackoffLimit, err = jc.PastBackoffLimit(jobName, runPolicy, replicas, pods)
if err != nil {
return err
}
}
if exceedsBackoffLimit || pastBackoffLimit {
// check if the number of pod restart exceeds backoff (for restart OnFailure only)
// OR if the number of failed jobs increased since the last syncJob
jobExceedsLimit = true
failureMessage = fmt.Sprintf("Job %s has failed because it has reached the specified backoff limit", jobName)
} else if jc.PastActiveDeadline(runPolicy, jobStatus) {
failureMessage = fmt.Sprintf("Job %s has failed because it was active longer than specified deadline", jobName)
jobExceedsLimit = true
}
and then jobExceedsLimit will be false when backofflimit is nil and pytorchjobs not expired. is it a bug? and the logic of check exceedsBackoffLimit, it used nums of requeue to judge if exceed limit, but it'll also requeue if reconcile has some error such as update conflict version. is it a bug too?
I search code of pytorch controller, find this one
msg := fmt.Sprintf("PyTorchJob %s is failed because %d %s replica(s) failed.", pytorchjob.Name, failed, rtype)
r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobFailedReason, msg)
if pytorchjob.Status.CompletionTime == nil {
now := metav1.Now()
pytorchjob.Status.CompletionTime = &now
}
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobFailed, commonutil.JobFailedReason, msg)
if err != nil {
commonutil.LoggerForJob(pytorchjob).Infof("Append job condition error: %v", err)
return err
}
you set CompletionTime to pytorchjobs.status, and set condition, so I print these address like this:
if pytorchjob.Status.CompletionTime == nil {
logrus.Error("now need to record completion time")
now := metav1.Now()
pytorchjob.Status.CompletionTime = &now
}
logrus.Errorf("jobstatus is %p and pytorchjobs.status is %p", jobStatus, &pytorchjob.Status)
logrus.Errorf("jobstatus is %v and pytorchjobs.status is %v", *jobStatus, pytorchjob.Status)
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobFailed, commonutil.JobFailedReason, msg)
the log is this:
ERRO[0094] jobstatus is 0xc0002c73c0 and pytorchjobs.status is 0xc00088db50
ERRO[0094] jobstatus is {[{Created True PyTorchJobCreated PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is created. 2022-08-01 14:50:28 +0800 CST 2022-08-01 14:50:28 +0800 CST} {Running True JobRunning PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is running. 2022-08-01 14:50:30 +0800 CST 2022-08-01 14:50:30 +0800 CST}] map[Master:0xc00171f8a8 Worker:0xc00171fa58] <nil> <nil> <nil>} and pytorchjobs.status is {[{Created True PyTorchJobCreated PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is created. 2022-08-01 14:50:28 +0800 CST 2022-08-01 14:50:28 +0800 CST} {Running True JobRunning PyTorchJob pytorchjob-2022-07-28t20-49-16-v10 is running. 2022-08-01 14:50:30 +0800 CST 2022-08-01 14:50:30 +0800 CST}] map[Master:0xc00171f8a8 Worker:0xc00171fa58] <nil> 2022-08-01 14:52:30.728314 +0800 CST m=+94.364590187 <nil>}
maybe its the reason of no completeion time when failed
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
We should investigate the reasons and should fix this bug.
/reopen /lifecycle frozen
@tenzen-y: Reopened this issue.
In response to this:
We should investigate the reasons and should fix this bug.
/reopen /lifecycle frozen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.