activeDeadlineSeconds not working in release v0.2.3
spec.runPolicy.activeDeadlineSeconds was set to 300s, mpijob was running over 22 minites and not terminated.
By reading the code, I found that activeDeadlineSeconds was not processed in mpi-operator, so is this a bug?
# mpijob:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: mpijob-v1-3
namespace: default
spec:
...
runPolicy:
activeDeadlineSeconds: 300
backoffLimit: 2
# mpijob status
status:
conditions:
- lastTransitionTime: "2020-07-06T05:15:20Z"
lastUpdateTime: "2020-07-06T05:15:20Z"
message: MPIJob default/mpijob-v1-3 is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2020-07-06T05:59:47Z"
lastUpdateTime: "2020-07-06T05:59:47Z"
message: MPIJob default/mpijob-v1-3 is running.
reason: MPIJobRunning
status: "True"
type: Running
replicaStatuses:
Launcher:
active: 1
Worker:
active: 1
# running pod
NAME READY STATUS RESTARTS AGE
mpijob-v1-3-launcher 1/1 Running 0 22m
mpijob-v1-3-worker-0 1/1 Running 0 22m
Issue-Label Bot is automatically applying the labels:
| Label | Probability |
|---|---|
| kind/bug | 0.89 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
v1 controller does not use this field anymore (removed in https://github.com/kubeflow/mpi-operator/pull/203). cc @carmark
@merryzhou Yes, this filed was removed in V1, since there is no launcher job in V1 controller, could you please describe why you need it ?
Actually, I want to have a field to indicate "The maximum time for mpijob to progress to running status before it is considered to be failed", maybe a bit like ProgressDeadlineSeconds in Deployment.Spec。
Currently, some unexpected error like "pod ImagePullBackOff" will make mpijob stuck in Created status, so I think it's better to have a mechanism to set the mpijob that can't run due to some unexpected conditions to fail state。
# pod
NAME READY STATUS RESTARTS AGE
mpijob-v1-2-launcher 0/1 Init:0/1 0 11m
mpijob-v1-2-worker-0 0/1 ImagePullBackOff 0 11m
# mpijob status:
status:
conditions:
- lastTransitionTime: "2020-07-07T07:20:57Z"
lastUpdateTime: "2020-07-07T07:20:57Z"
message: MPIJob default/mpijob-v1-2 is created.
reason: MPIJobCreated
status: "True"
type: Created
replicaStatuses:
Launcher: {}
Worker: {}
startTime: "2020-07-07T07:20:57Z"
@merryzhou Thanks for your information, I will file a PR to support it.
What's the status of this issue?
Please upgrade to v0.3 or v0.4
/close
@alculquicondor: Closing this issue.
In response to this:
Please upgrade to v0.3 or v0.4
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.